Machine Learning (particularly Deep Learning), has seen a dramatic increase in the diversity of problems it is applied to, from speech recognition to machine translation to sensitive domains such as healthcare. A common theme in all of these use cases is the need for a large, accurately labelled dataset. This dataset is typically collected from human annotators who range from being novices (no task specialization) to highly trained experts, such as doctors, used to diagnose complex medical conditions.

Figure 1: Average disagreement of three grades sampled on data. For each retinal image x, we sample 3 grades given by doctors on x. Each grade takes a value between 1 (None–i.e., a normal image) to 5 (Proliferative–i.e., severe disease). All three doctors agree fully on just over 45% of the images. But about 30% have a single disagreement of magnitude 1 e.g. the grades might be 2,2,3, or perhaps 2,1,2. And around 25% of the images have a disagreement greater than this: either two doctors disagree by more than one grade (e.g. grades 2, 4) or all three doctor grades disagree, or both.

Particularly in this last setting, there is a serious challenge posed by label disagreements in the data. Each instance (in our case an image) in the dataset will be labelled by multiple doctors. But these diagnoses often disagree, sometimes significantly (Figure 1). This suggests a question: can we use machine learning to predict the images that might be the source of maximal disagreement? This could help us identify cases that need more labels, or change training methods appropriately.

A first approach is to train a classifier on the data as is standard (perhaps on histogram labels) and use the classifier output probabilities (e.g. entropy, variance) as a measure of uncertainty. However, features that result in greater uncertainty – distractor artefacts, obscured image attributes – may be completely unrelated to the classification problem.

Therefore, we compare this to direct uncertainty prediction: training a model to predict the disagreement directly from the image. This can be done by using binary targets of agree/disagree for each image. We can picture the difference between these methods with the following diagram:

We hypothesize that under a couple of natural conditions:

  1. The data has features that are indicative of uncertainty (but not necessarily of the label)
  2. The number of noisy labels per datapoint is small (2-3)

direct uncertainty prediction will outperform the two step uncertainty via classification. We first test this out under a synthetic task where we can precisely control the data parameters. We draw data from a mixture of several Gaussians, and generate 2-3 noisy labels for each datapoint based on the probability of it coming from the different mixture centers. The mixture centers form two clusters, one cluster of high variance and another of low variance. This setting satisfies both of the aforementioned conditions, and we see that direct uncertainty prediction does indeed perform better in this setting.

Next, we turn to our main application, which is with healthcare data, specifically, retinal images. These images can be used to diagnose a patient with Diabetic Retinopathy (DR). DR is graded on a five class scale: None, Mild, Moderate, Severe, Proliferative, corresponding to grades 1 to 5. Of particular clinical interest is a threshold at grade 3, which divides examples into non-referable/referable DR, which may result in very different clinical pathways for the patients. Our data consists of a large dataset T, which has images with very few noisy labels. We also have a very small gold standard adjudicated dataset A, which, aside from several individual doctor labels, this time by specialists, also contains a single label obtained via a discussion between several doctors – an adjudicated label.

We train our uncertainty prediction methods on a train/test split on T, and test on A. In particular, we look at three applications of uncertainty prediction: We test to see if our models can correctly identify images in A where an aggregate statistic of the individual doctor labels will differ from the adjudicated label. We apply a ranking test to see if the models can correctly rank images for which the adjudicated label is “furthest away” from the entire distribution of individual doctor grades. In this setting, we can also directly compare the model ranking to ranking by using a subset of doctors. We find that model induced ranking lies between 5 and 6 doctor grades: the model performs better at ranking than using 5 doctor grades, but worse than using 6 doctor grades. We test to see how budgeting doctor labelling effort according to the model rankings results in improvement in agreement with the adjudicated label.

We evaluate each one of these tests in multiple ways, and in general, find that direct uncertainty prediction convincingly outperforms the two step uncertainty via classification.

Check out our paper, Direct Uncertainty Prediction with Applications to Healthcare, for more details!

Acknowledgements: Thanks to Jon Kleinberg, Ziad Obermeyer and Sendhil Mullainathan for their feedback on this post.