In my last post, I covered a pitfall of data analysis where two independent factors may appear to be correlated with each other. Today I’ll cover another common pitfall when interpreting what your data are telling you. This one is particularly troublesome because most peoples’ intuition is so clear, but so wrong.
To illustrate, I’ll use an example of our student risk model.
We have a model that can be applied to students and it’s 85% accurate in classifying students into groups of “will pass this class” (classified not risky) and “will fail this class” (classified risky). Boiling it down to two outcomes like red light / green light understates the complexity, but once the math is done, it really is that easy to use.
Pretty awesome, right?
Yeah, we get that a lot.
Now, let’s say you’re teaching a class where you know 90% of students pass, but 10% of students don’t. You check your dashboard and the light turns red for one of the students: they’ve just moved over to the “risky” bucket indicating they’re at risk for failing your class.
Pop quiz: what is the probability that the student will fail the class?
[ ♪ ♫ theme from Jeopardy! plays ♫ ♪ ♪ ]
Did you say 85% because that’s the accuracy I gave you? Did you have a vague sense that it’s more than 10%, but 85% isn’t right either? Or did you say 39% because you already saw where I was going and computed the precision or positive predictive value?
Answer: When a student is classified as risky, they’ll fail the class 39% of the time.
That sounds counterintuitive, but let’s look at the math. If you have 1,000 students in the population, we expect that 100 of them are probably going to fail. We also know that our classifier will miss some of them. It has a recall rate of 85% so it will miss 15 of the 100. It has a specificity of 85% so it will also mis-classify some students who aren’t actually risky (135 of those). Hey, nobody’s perfect.
The matrix looks like this:
Classified Risky | Classified Not Risky | Total | |
Actually Risky | 85 | 15 | 100 |
Actually Not Risky | 135 | 765 | 900 |
Total | 220 | 780 | 1000 |
What we really want to know is “Among the 220 students who are classified as risky, what’s the probability of a failing grade?” That’s 85 out of 220 or 39%.
What the what‽ Why turn a light red when a student has a 39% chance of failing the class? Is that a good predictive value?
The answer lies in what comes next. Presumably you’ll do something when you learn a student is risky. Maybe something time-consuming and expensive. In fact, I’ll bet it’s so expensive that it can’t reasonably be applied to 100% of students because if that were an option, you’d already be doing it as a matter of course.
Instead of saying “a positive test means the student will fail 39% of the time,” I could have said “an intervention for the riskiest 22% of students catches 85% of the at-risk ones.” Or I could have said “when the model says a student will pass, it’s right 98% of the time.” Those statements are all true. That’s helpful, but it doesn’t exactly tell us whether that’s a good predictive value.
How will you respond when a light turns red? If it’s something you can afford to do 22% of the time, and it’s ok to miss the 2% of students who the model misses, this is a good predictive value for you. Just don’t confuse accuracy (how often the model is right) with predictive value (real risk among people classified as risky).
In a future post, I’ll work through how we can trade positive predictive value for recall so you can focus (or broaden) your results.