AI Outside In: Machine Learning's Triangle of Error
To make information on Artificial Intelligence more useful and accessible to everyone, from students to non-technical people curious about how AI works, we’ve teamed up with Google’s People + AI Research (PAIR) initiative, whose mission is to make partnerships between people and AI more productive, engaging, and fair.
Here is the third post in a three-part series — AI Outside In — by PAIR Writer in Residence, independent tech writer and blogger David Weinberger. He offers his outsider perspective on key developments in AI research and will explain central concepts in the field of machine learning. He’ll be looking at the technology within a broader context of social issues and ideas. His opinions are his own and do not necessarily reflect those of Google.
Machine learning's superpower
When we humans argue over what’s fair, sometimes it’s about principles and sometimes about trade-offs. But machine learning systems “think” about fairness in terms of three interrelated factors: two ways the machine learning (ML) can go wrong, and the most basic way of adjusting the balance between these potential errors. Deciding which type of error you prefer depends entirely on the sort of fairness — defined mathematically — you’re aiming at. But one way or another, you have to decide.
At their heart, many ML systems are classifiers. They ask: Should this photo go into the bucket of beach photos or not? Should this dark spot on a medical scan be classified as a fibrous growth or something else? Should this book go on the “Recommended for You” list or not? ML’s superpower is that it lets computers make these sorts of “decisions” based on what they’ve inferred from looking at thousands or even millions of examples that have already been reliably classified. From these examples they notice patterns that indicate which categories new inputs should be put into.
While this works better than almost anyone would expect – and a tremendous amount of research is devoted to fundamental improvements in classification algorithms – virtually every ML system that classifies inputs mis-classifies some of them. An image classifier might think that the photo of a desert is a photo of a beach. The cellphone you’re dictating into might insist that you said “Wreck a nice beach” instead of “Recognize speech.”
So, researchers and developers typically test and tune their ML systems by having them classify data that’s already been reliably tagged — the same sort of data these systems were trained on. In fact, it’s quite typical to hold back some of the inputs the system is being trained on so that it can later be tested on data it has not yet seen. Since the right classifications are known for the test scans, the developers can quickly see how well the system has done.
In this sort of basic testing, there are two ways the system can go wrong. It can put some of the images into the “Beach” bucket when in fact they are not photos of a beach. Or it can miss some images of beaches and mistakenly put them into the “No Beach” bucket.
In this post, let’s call the first “False alarms”: the ML thinks the image is a beach but it isn’t.
We can all the second “Missed targets”: the ML failed to recognize an actual beach photo.
ML practitioners have terms for these two types of errors. False alarms are false positives. Missed targets are false negatives. But just about everyone finds these confusing names, even many professionals. One problem is that it’s too easy to confuse the positivity of the classification with the positivity of the trait being classified. For example, photos of cups might be examined to see if they are not full. An image of a full cup put into the “not full” bin would count as a false positive even though we might think of “not full” as a negative. And logically, shouldn’t a false negative be a positive?
So, let’s go with false alarms and missed targets as we talk about errors.
Take an example that doesn’t involve machine learning, at least not yet. Let’s say you’re adjusting a body scanner at an airport security checkpoint. Those who fly often can attest to the fact that most of the people for whom the scanner buzzes are in fact not security threats. They get manually screened by an agent — often a pat-down — and are sent on their way. That’s not an accident or a misadjustment. The scanners are set to generate false alarms rather frequently: if there’s any doubt, the machine beeps so a human will double check.
That’s a bit of a bother for the mis-classified passengers, but the alternative is far worse. If the machine were set to create fewer false alarms, then potentially it would miss more genuine threats. So it errs on the side of false alarms, rather than missed targets.
There are two things to note here. First, reducing the false alarms can increase the number of missed targets, and vice versa. Second, which is the better thing to do depends on the goal of the machine learning system. And that always depends on the context.
For example, false alarms are not too much of a bother when the result is that more passengers get patted down. But if the ML is being used to recommend preventive surgery, false alarms could potentially lead people to put them at unnecessary risk. Having a kidney removed for no good reason is far worse than getting an unnecessary pat down.
The consequences can reach deep. If your ML system is predicting which areas of town ought to be patrolled most closely, then tolerating a high rate of false alarms may mean that local people will feel targeted for stop-and-frisk operations, potentially alienating them from the police force, which can have its own harmful consequences on a community.
It gets no less complex when considering how many missed targets to accept. If you tune your airport scanner so that it generates fewer false alarms, some people who are genuine threats might fall out of that classification and will be waved on through, endangering an entire airplane. On the other hand, if your ML is deciding who is worthy of being granted a loan, a false alarm – someone who is granted a loan and then defaults on it – may be more costly to the lender than turning down someone who would have repaid the loan.
And to not miss an opportunity to be confusing when talking about ML, consider an online book store that presents each user with suggestions for the next book to buy. If the list of suggestions is full of false alarms, readers won’t buy any of the books, hurting the site’s revenues. Avoiding missed opportunities is far less important to the site than keeping the suggestion queue full of books readers might actually want to buy.
Suppose you’re not getting the results you want from your ML system? Perhaps you’re not getting the revenues you want. Or worse, suppose you find that the system is being systematically unfair. For example, maybe you find that your loan application is creating many more missed opportunities for applications from women or racial minorities.
Think hard and out loud
Adjusting the mix of false alarms and missed opportunities brings us to the third point of the Triangle of Error: the ML confidence level.
One of the easiest ways to adjust the percentage of false alarms and missed targets is to change the threshold of confidence required to make it into the bin. (Another way is to train the system on better data, to adjust its classification algorithms, and in other ways to drive down the number of missed opportunities and false alarms — always a good thing to do.) For example, suppose you’ve trained an ML system on hundreds of thousands of images that have been manually labeled as smiling or not. From this training, the ML has learned that a broad expanse of light patches towards the bottom of the image is highly correlated with smiles, but then there are the Clint Eastwoods whose smile is much subtler. When the ML comes across a photo like that, it may classify it as smiling, but not as confidently as the image of the person with the broad, toothy grin.
If you want to lower the percentage of false alarms, you can raise the confidence level required to be put into the “Smiling” bin. Let’s say that on a scale of 0 to 10, the ML gives a particular toothy grin a 9, while Clint gets a 5. If you stipulate that it takes at least a 6 to make it into the “Smile” bin, Clint won’t make the grade; he’ll become a missed target. But some other images rated 5 may well be faces that indeed are not smiling; the ML will correctly classify them as not smiling.
Adjusting the confidence level can help get the balance of false alarms and missed opportunities right. But what counts as right is not something the machine learning system can decide on its own. That takes a human deciding what she wants from the system and what the trade-offs should be.
Deciding on the trade-offs occasions difficult conversations, now that machine learning has given us that power and that obligation. But perhaps one of the most useful consequences of machine learning at the social level is not only that it requires us humans to think hard and out loud about these issues, but the requisite conversations at least implicitly acknowledge that we can never entirely escape error. At best we can decide how to err in ways that meet our goals and that treat all as fairly as possible.
Original illustration by Anna Young.
This work is licensed under a Creative Commons Attribution 4.0 International License.