What can we learn from the last 200 million things that happened in the world?

If you've been following the political science geek Twitter/blogosphere for the last few days, you've probably come across the mysterious acronym GDELT. The excitement over Global Data on Events, Location, and Tone - to give its full name -- is understandable. The singularly ambitious project could have a transformative effect on how we use data to understand and anticipate political events.

Essentially, GDELT is a massive list of important political events that have happened -- more than 200 million and counting -- identified by who did what to whom, when and where, drawn from news accounts and assembled entirely by software. Everything from a riot over food prices in Khartoum, to a suicide bombing in Sri Lanka, to a speech by the president of Paraguay goes into the system.

Similar event databases have been built for particular regions, and DARPA has been working along similar lines for the Pentagon with a project known as ICEWS, but for a publicly accessible program (you can download it here though you'll need some programming skills to use it) GDELT is unprecedented in it geographic and historic scale. The database updates with new events every night following the day's news and while it currently goes back to 1979, its developers are working on adding events going back as far as 1800 according to lead author Kalev Leetaru, a fellow at the University of Illinois Graduate School of Library and Information Science. (I've previously written about his work here.)

"It's the sheer size," says Leetaru, when asked what makes the project unique. " And the resolution. It's not just saying an event took place in Syria. It's saying who did what to whom. It will tell us that it was the military who attacked Christian civilians in this city on this day. If the article says it was worshippers who were attacked in their church, that will all be captured."

Events are classified by four different types: material conflict, material cooperation, verbal conflict, and verbal cooperation. Within those categories, events are classified using a 300 category taxonomy system called CAMEO, developed by Penn State's Philip A. Schrodt, to provide detail on the actors and the action that occurred.

For instance, an event like "Students and police fought in the Egyptian capital" will be coded as "EDU fought COP," and also include the location and time when the event took place.

"We're already hard at work on a new version that will expand this dramatically, adding everything from disease to different classes of political transitions, and things like cyberwarfare," says Leetaru, noting that new types of events have increased in importance in the decade since the CAMEO system was developed. Geopolitically important financial events may also soon be included. 

So what can we do with all this? Well for one thing it could be an extremely powerful tool for researchers looking to track political events over time, and even predict them in the future. One early paper by Penn State PhD. candidate James Yonamine uses GDELT data to track patterns of violence in different districts of Afghanistan:

(Maps by Penn State Geography PhD candidate Joshua Stevens)

GDELT could also be used to study political rhetoric, for instance the kind of statements that politicians make in the run-up to war in order to prime their citizens for conflict.

Leetaru sees even broader applications for researchers using a branch of mathematics known as complexity theory (closely related to chaos theory) to identify global patterns in seemingly random human events. "Most datasets that measure human society, when you plot them out, don't follow these nice beautiful curves," he says. They're very noisy because they reflect reality. So mathematical techniques now let us peer through that to say, what are the underlying patterns we see in all this."

Of course, for all the high-tech software behind its creation and its potentially far-out applications, GDELT is, at its core, a way of summarizing news coverage, and old fashioned legacy-media news coverage at that. The sources used to identify events include world news coverage from Agence France Press, the AP, BBC, Christian Science Monitor, New York Times, UPI, and the Washington Post, as well as a few more specialized outlets and Google News. Leetaru notes in his recent paper introducing the project that the increasing availability of news on the web has led to a "dramatic increase [of recorded events] since the beginning of the 21st century."

This leads to another potential problem: that the frequency of recorded events in a given region will be less correlated to the actual frequency of them occurring than the frequency of the international media covering them. A politically motivated shooting in Syria or the West Bank this month will probably be recorded. In a rural region of the Congo or Central African Republic? It's harder to say.  

Leetaru says he's looking into supplementing some of the journalistic data with information from social media driven projects like Ushahidi. "As quality journalism is under attack from all sectors, whether that's government stepping up efforts to squelch it or the collapsing economics of it, we're starting to look at all the citizen journalism that's out there," he says. "One of the reasons we're focusing on mainstream journalism is that social media is a relatively new phenomenon."

Leetaru notes that he's been cautious to integrate social media data because of difficulties with quality control and verification. "Journalism's not perfect either, but at least there's that professional code of ethic," he says.

So whether or not data kills theory in the social sciences, someone still needs to get the information in the first place.

BULENT KILIC/AFP/Getty Images