CIS Seminar: “Rater Equivalence: An Interpretable Measure of Classifier Accuracy Against Human Labels”
October 20, 2022 at 3:30 PM - 4:30 PM
Details
Organizer
Venue
In many classification tasks, the ground truth is either noisy or subjective. Examples of noisy ground truth include: does this radiology image show a cancerous growth? does this radar data portend an imminent tornado? Examples of subjective ground truth include: which of two alternative paper titles is better? is this comment toxic? what is the political leaning of this news article? We refer to tasks where human labels are the only indication of ground truth available at the time that decisions must be made as survey settings. In these settings, measures of classifier accuracy against human labels, such as precision, recall, and cross-entropy, confound the quality of the classifier with the level of agreement among human raters. Thus, they have no meaningful interpretation on their own. We describe a procedure that, given a dataset with predictions from a classifier and K labels per item, rescales any underlying accuracy measure into one that has an intuitive interpretation. The K raters are divided into a source panel and a target panel. The source panel’s labels for an item are combined to produce a predicted label for another rater. Both the source panel predictions and classifier predictions are scored against the same target panel’s labels. The rater equivalence of any classifier is the minimum number of source raters needed to produce the same expected score as that found for the classifier. We explore the stability of the rater equivalence measure as the target panel size varies and find one underlying measure, determinant mutual information, for which it is invariant.

