Missing Datasets

These days, I’m a fellow over at the Data & Society Research Institute. I’m working on a couple of different projects, but the primary one has to do with missing datasets.

Calling something “missing” automatically implies that it should exist, and that’s sort of the point of my project. We’re living in a time of unprecedented levels of data collection. This isn’t a revolutionary insight, it’s just a rote fact. We are systematically tracked, recorded, and documented in ways that are more thorough and expansive than ever before. Though people have different relationships and attitudes to the tactics, methods, vehicles of this data collection (attitudes that range from hopefulyness around perceived benefits to desperate techno-pessimism about potential abuses), no one is exempt.

But at the exact same time that this massive overcollection is unfolding, there are blank spots in the data ecosystem. That is, within contexts that seem to have nearly every possible metric quantified and recorded, there exist spaces that are curiously devoid of data.

Here are some pretty familiar examples that explain what I mean: Despite the fact that the workplace is heavily-studied by sociologists and companies have obvious incentives for collecting data on employees, before ProPublica’s 2013 initiative there was no data on unpaid internships. There was no set of data that anyone could point to that gave any idea of how many students were working unpaid internships, or how many companies were offering them. It was a missing dataset.

An even better-known (and much more political) example has to do with civilians and the police. It wasn’t until quite recently, thanks to initiatives like D. Brian Burghart’s Fatal Encounters website and The Guardian’s The Counted campaign, that we as a public started to have an idea of the number of civilians killed in interactions with legal enforcement agencies. Prior to their work, that was a missing dataset.

In the article The Collection and the Cloud, Amelie Abreu points out that "...the Internet Archive isn’t the Internet Archive, but an Internet Archive, very much built and collected from a certain standpoint and position of power". Abreu's point -- that there’s always a reason why certain things get saved and others don’t -- applies to data as well. There’s a reason why certain data becomes a dataset, and that reason is as much personally and institutionally motivated as it is technologically. There’s not much incentive for a company to collect data on why it isn’t paying employees, just like there isn’t much incentive for the police to talk about how many unarmed civilians are killed each year, or there isn’t much incentive for tech companies to release abysmal diversity statistics. It’s not that organizations are maliciously trying to hide information so much as there’s just no reason for them to go out of their ways to collect, let alone publish, that data.

But of course, there is reason for other people to have that data, and in a time where data is collected about nearly everything, it wouldn't be surprising for many to feel as though not having data means that something doesn’t exist. For every dataset where there’s an impetus for someone not to collect, there’s a group of people who would benefit from its presence. More data doesn’t always mean better answers, but in cases where data is used as the end-all tool of proof or a definitive measure for change, then it’s clear that lacking it can be a serious structural disadvantage.

And here’s where my project comes in. I’m interested in finding and helping those who are directly affected by the issues in question fill other missing datasets. Is there a way to both provide access to previously unattainable datasets and give those people who have a stake in information the ability to affect it?

That’s the high-level overview of some of the work I’ll be doing this year. I’m just at the beginning of the process, but if you’re interested in any of these questions or have relevant datasets of your own, please do reach out.