The Verizon records and the dark side of 'big data' [Updated]

One month ago, in her FP article Think Again: Big Data, Kate Crawford wrote:

While many big-data providers do their best to de-identify individuals from human-subject data sets, the risk of re-identification is very real. Cell-phone data, on mass, may seem fairly anonymous, but a recent study on a data set of 1.5 million cell-phone users in Europe showed that just four points of reference were enough to individually identify 95 percent of people. There is a uniqueness to the way that people make their way through cities, the researchers observed, and given how much can be inferred by the large number of public data sets, this makes privacy a "growing concern." We already know, thanks to academics like Alessandro Acquisti, how to predict an individual's Social Security number simply by cross-analyzing publicly available data.

Today, some of the fears Crawford was talking about seem to be realized with the news, broken by the Guardian's Glenn Greenwald, that the NSA has ordered Verizon to provide daily information on all telephone calls within its system for a three-month period ending on July 19. This could be more than 98.2 million customers and there are obviously unanswered questions about which other companies received similar orders.

It's only fairly recently that the technology for analysis has advanced to the point that a dataset of this size would be useful. As Greenwald wrote, the data doesn't include personal information or the content of calls, but "its collection would allow the NSA to build easily a comprehensive picture of who any individual contacted, how and when, and possibly from where, retrospectively."

As Shane Harris points out, the NSA's potential uses for this data could go beyond tracking individuals :

As I wrote in my book, The Watchers, the NSA has long been interested in trying to find unknown threats in very big data sets. You'll hear this called "data mining" or "pattern analysis." This is fundamentally a different kind of analysis than what I described above where the government takes a known suspect's phone number and looks for connections in the big metadatabase. 

In pattern analysis, the NSA doesn't know who the bad guy is. Analysts look at that huge body of information and try to establish patterns of activity that are associated with terrorist plotting. Or that they think are associated with terrorist plotting.

The NSA spent years developing very complicated software to do this, and met with decidedly mixed results. One such invention was a graphing program that plotted thousands upon thousands of pieces of information and looked for relationships among them. Critics called the system the BAG, which stood for "the big ass graph." For data geeks, this was cutting edge stuff. But for investigators, or for intelligence officials who were trying to target terrorist overseas, it wasn't very useful. It produced lots of potentially interesting connections, but no definitive answers as to who were the bad guys. As one former high-level CIA officer involved in the agency's drone program told me, "I don't need [a big graph]. I just need to know whose ass to put a Hellfire missile on."

But of course, the technology to do this kind of pattern analysis has improved dramatically since the Bush years. As Emanuel Pastreich put it in an op-ed written several days before the latest news, "The dropping cost of computational power means that individuals can gather gigantic amounts of information and integrate it into meaningful intelligence about thousands, or millions, of individuals with minimal investment."

I've highlighted some exciting uses of this kind of computational power on this blog, but we're also seeing the birth of a kind of government surveillance that lawmakers and privacy advocates have never had to contend with before. 

Update: Looks like there's more to come.

Barton Gellman and Laura Poitras of the Washington Post report that "the National Security Agency and the FBI are tapping directly into the central servers of nine leading U.S. Internet companies, extracting audio, video, photographs, e-mails, documents and connection logs that enable analysts to track a person’s movements and contacts over time." The companies involved are Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, and Apple.

Gellman and Poitras write that the program, called PRISM, and others like it, show how "fundamentally surveillance law and practice have shifted away from individual suspicion in favor of systematic, mass collection techniques."


Justin Sullivan/Getty Images