In both the practice of machine learning and the practice of medicine, a significant challenge is presented by disagreements amongst human labels. This label disagreement challenge becomes a full-fledged clinical problem in the healthcare domain, where individual ‘labellers’ are now highly trained human experts, doctors, with labels corresponding to diagnoses. Despite this domain expertise, the disagreement issue remains fundamental. We define this clinical problem as the medical second opinion problem. Patient cases are typically viewed by one expert, but some cases naturally give rise to significant variations of expert diagnoses. As this expert disagreement arises from human judgement and bias, the instance (patient case) itself contains features that may give rise to an intrinsic level of disagreement in human evaluation, separate from uniform noise, or noise dependent on solely ground truth labels (diagnoses).

Figure 1: Average disagreement of three grades sampled on data. For each retinal image x, we sample 3 grades given by doctors on x. Each grade takes a value between 1 (None–i.e., a normal image) to 5 (Proliferative–i.e., severe disease). All three doctors agree fully on just over 45% of the images. But about 30% have a single disagreement of magnitude 1 e.g. the grades might be 2,2,3, or perhaps 2,1,2. And around 25% of the images have a disagreement greater than this: either two doctors disagree by more than one grade (e.g. grades 2, 4) or all three doctor grades disagree, or both.

This motivates applying machine learning to predict which cases might give rise to the greatest expert disagreement.A first approach is to train a classifier on the data as is standard (perhaps on histogram labels) and use the classifier output probabilities (e.g. entropy, variance) as a measure of uncertainty. However, features that result in greater uncertainty – distractor artefacts, obscured image attributes – may be completely unrelated to the classification problem.

Therefore, we look at direct uncertainty prediction: training a model to predict a 0/1 low/high disagreement value directly from the input. We can picture the difference between these methods with the following diagram:

We hypothesize that under a couple of natural conditions:

  1. The data has features that are indicative of uncertainty (but not necessarily of the label)
  2. The number of noisy labels per datapoint is small (2-3)

direct uncertainty prediction will outperform the two step uncertainty via classification. We first test this out under a synthetic task where we can precisely control the data parameters. We draw data from a mixture of several Gaussians, and generate 2-3 noisy labels for each datapoint based on the probability of it coming from the different mixture centers. The mixture centers form two clusters, one cluster of high variance and another of low variance. This setting satisfies both of the aforementioned conditions, and we see that direct uncertainty prediction does indeed perform better in this setting.

Next, we turn to our main application, which is with healthcare data, specifically, retinal images. These images can be used to diagnose a patient with Diabetic Retinopathy (DR). DR is graded on a five class scale: None, Mild, Moderate, Severe, Proliferative, corresponding to grades 1 to 5. Of particular clinical interest is a threshold at grade 3, which divides examples into non-referable/referable DR, which may result in very different clinical pathways for the patients. Our data consists of a relatively large dataset T, which has images with very few noisy labels. This forms our train set and a first holdout (test) set.

We also have a very small gold standard adjudicated dataset A, which, aside from several individual doctor grades (this time by specialists) also contains a single label obtained via active discussion between several doctors – an adjudicated label. This dataset allows us to test the clinical question of greatest interest:

Can models (trained on noisy data) correctly identify cases where an individual doctor is most likely to disagree with an unknown ground truth condition?

We test this in our first evaluation on the adjudicated set: identifing images in A where various averages of the individual doctor labels will differ from the adjudicated label. We also apply a ranking test to see if the models can correctly rank images for which the adjudicated label is “furthest away” from the entire distribution of individual doctor grades. In this setting, we can also directly compare the model ranking to ranking by using a subset of doctors. We find that model induced ranking lies between 5 and 6 doctor grades: the model performs better at ranking than using 5 doctor grades, but worse than using 6 doctor grades. We test to see how budgeting doctor labelling effort according to the model rankings results in improvement in agreement with the adjudicated label.

We evaluate each one of these tests in multiple ways, and in general, find that direct uncertainty prediction convincingly outperforms the two step uncertainty via classification.

Check out our paper, Direct Uncertainty Prediction for Medical Second Opinions, for more details!

Acknowledgements: Thanks to Jon Kleinberg, Ziad Obermeyer and Sendhil Mullainathan for their feedback on this post.