Earlier this week, LA-based public defender and union Ace Katano (ed note: yours truly) tweeted this story:
The replies were, unsurprisingly, full of upset people. It’s no stretch to imagine the young man’s “affiliation” coming up in a future interaction with police, making it more likely he’ll get in worse trouble next time and helping to perpetuate the mass incarceration machine. But all I could think of was how badly a predictive policing algorithm could misunderstand this data.
Others have offered excellent in-depth treatments of predictive policing and the data that goes into it. This isn’t one of those. What follows, instead, is a discussion of a specific methodological issue that arises in data analysis environments like those of predictive policing, and some thoughts about what data scientists can do to anticipate and work against it.
So, the tweet.
If the unnamed “juvie client” from Katano’s tweet got booked by the LAPD, the data point about his purported gang affiliation may have been captured in the city’s predictive policing program, which monitors certain individuals who have interacted with police based on an algorithmically generated “threat score.” We can see how an algorithm might read a 1 in the juvie client’s gang affiliation column and calculate a higher score. It might even be the difference between monitoring and not monitoring him.
There are plenty of bad things going on in the above scenario — for example, the loaded usage of the term “threat” and the idea that it can be meaningfully or univariately quantified at all — but the one that concerns us today is this: that data point doesn’t measure gang affiliation. The officer may have meant it to, and it may be taken to by the algorithm that calculates the threat score, but it doesn’t, for the simple reason that gang affiliation isn’t a binary variable.
It should be intuitive that this is the case. That 1 in our imaginary gang affiliation column could mean anything from actively doing violent crime for a gang to having a cousin in a gang, a wide spectrum any meaningful threat assessment — algorithmic or not — ought to take into account. In fact, a useful understanding of an individual’s relationship to a gang probably can’t even come from a single variable; it probably takes multiple different spectra, including attitudes about the gang, behaviors associated with commitment to the gang, how long the individual has been involved and at what rate their involvement has increased or decreased, and many other factors. It probably also takes qualitative information, like interview transcripts, that can’t be easily ingested by an algorithm. Even then, our very framing—“gang affiliation” —inherits certain baggage from its origins in law enforcement operations that merits questioning. A 1 or a 0, though, tells us next to nothing useful about an individual’s relationship to gangs, because a binary value can’t capture much about something so complex.
It’s not just that this binary view is an inaccurate way of measuring a thing we’re calling “gang affiliation.” The issue isn’t that a 1 or a 0 doesn’t measure this thing in this particular case; it’s that it can’t. To put it another way, it’s not so much that the data point is “wrong,” but that the data type doesn’t match the underlying phenomenon.
I’m using the term “type” in the programming sense, i.e., the range of values a particular piece of data can take. But it’s helpful to think of a data type as capturing certain assumptions about the phenomenon being measured. Defining, for example, the range of values a gang affiliation metric can take encodes certain ideas about the idea of gang affiliation, which, again, is inseparable from its usage by law enforcement. What kind of thing even is gang affiliation? Is it binary? Is it a spectrum? Is it more like a set of categories? How many decimal points of precision does it need? Is it a number at all?
Programming languages protect against explicit type errors, like if you try to divide an integer by a string of characters, but they’re not much help if you inherit poorly typed data that’s not obviously invalid. If the data collectors and data analysts are the same people, a type mistake might be an easy fix: just go out and collect more data with the right type this time, assuming there even is a “right type.” But things get messy when you’re using someone else’s data. Regardless of their validity at measuring the thing being studied, the encoded ideas that types represent get passed down to the people and algorithms that end up using the data. By then, it’s often hard to get the data collector go to back and change them, and sometimes it’s not even clear to the downstream analyst or algorithm wrangler that they need changing.
Data is an imperfect interface, not just to the phenomenon being measured, but also to the data collectors’ assumptions about that phenomenon and about how their data will eventually be used. Even in the most charitable reading of Katano’s story, which assumes the officer knows that gang affiliation is more complicated than a 1 or a 0, there’s still a gap in understanding how an algorithm might misuse that information. Just because a particular cop knows that that 1 doesn’t really capture the whole story, it doesn’t mean they know an algorithm is going to read the data and assume that it does.
Because algorithms for predicting human behavior, like the data scientists who create them, often get their data from external sources (e.g., police departments), there’s not always opportunity to argue assumptions. The analyst or algorithm may have to trust that these external sources are thinking about the problem in the right way and that the data measures what it purports to measure. But when there’s a misunderstanding of types and a large organizational distance between analysts and their data sources, the analysts can’t change the sources’ means of data collection, and the collectors can’t foresee the analysts’ means of prediction.
One measure data scientists can take against this problem is subjecting the conditions under which their data was originally collected to more scrutiny. Ask: what kind of interaction between the data collector and subjects would it take to produce this data? Is it reasonable to assume this kind of interaction took place? Was this data prepared specifically for analysis or is it a dataset derived from record-keeping? What was the purpose of collecting this data and does my analysis fall outside that purpose? Can this phenomenon be meaningfully measured at all? Answers to these questions may be hard to come by, but keeping the data collection methods in mind can help steer analysis away from bad assumptions.
In the case of the cop and the juvie client, I’d prefer if police departments didn’t collect that kind of data in the first place. But the difficulties in working with externally collected data more broadly aren’t going away. And while there’s no straightforward solution, there are approaches for identifying these issues and mitigating their effects. Even if we can’t usually expect to tell external parties how to collect better data, we can be more prepared to work meaningfully with what we inherit from them. As data collection comes increasingly in contact with operations in domains like law enforcement, it’s possible that better training for collectors will lead to better data. But until it does, it falls to the data scientist to ensure that their analyses and those of the algorithms they design don’t pass on the flaws in the data they inherit.