## Access Event Data in Google BigQuery From R

I spent a good portion of this past spring working with GDELT. Unfortunately, a perfect storm of legal troubles and methodological criticisms this spring turned GDELT from the rising star of Computational Social Science into the ugly behemoth we’d all prefer not to discuss. I stopped updating my copy of GDELT and refocused my efforts to other projects.

So when Google Announced that GDELT would be hosted for free on BigQuery, I was only mildly interested (read: didn’t care). GDELT exhibits some ideal characteristics for a demonstration of high-speed data processing: It’s small enough that you could play with it on a well-equipped consumer computer, but large enough (currently ~100GB) that doing so is a terrible idea. Some of the queries I worked with, for example, would occupy hours of computing time on a fairly powerful machine. When I tried the some of the same queries on the BigQuery Console, it took ~10 sec, an improvement of three orders of magnitude.

But then I happened to be looking over the dplyr documentation and noticed that it had BigQuery integration, and I became slightly more interested. dplyr wraps the API from BigRQuery (one of Hadley‘s other projects). The effect of this is that we can integrate complex operations exceedingly large data sources into R workflows with little overhead.

Getting BigRQuery set up isn’t the most straight-forward project, but it isn’t too hard either. From the BigRQuery README:

If you just want to play around with the bigquery API, it’s easiest to start with the Google’s free sample data. To do that, you’ll also need to create your own project for billing purposes. If you’re just playing around, it’s unlikely that you’ll go over the 10,000 request/day free limit, but google still needs a project that it can bill (you don’t even need to provide a credit card).

To create a project:

2. Click “Create Project” at the top
3. Select a name and project ID, and click “Create”
4. Turn on the BigQuery API by clicking “APIs & Auth” on the left, scrolling down to “BigQuery API”, and clicking the button at the right from “OFF” to “ON”.
5. Click on “Overview” at the left
6. Either the Project ID or Project Number at the top can be used to identify your project withbigrquery.

From there, you can access GDELT through the BigQuery Browser Console. Theoretically, you should be able to access it through the API at this point, but I couldn’t get that working. Instead, I just made a copy of the events table and imported it into my own project.  To do that:

1. Click on the dropdown arrow next to your project’s name in the left pane.  Select “Create a New Dataset”
2. For Simplicity’s sake, just call it “GDELT”
3. Under “gdelt-bq:full” in the left pane, click the dropdown arrow next to “events” and click on “Copy table”
4. Copy the table to your project, to the GDELT dataset, and (again, for simplicity’s sake) call it “events”
5. You can access the data in R and play with it in dplyr.  Copy your project’s billing ID and paste it into the following gist:

When you try this, also take note:

The first time you use bigrquery in a session, it will ask you to authorize bigrquery in the browser. This gives bigrquery the credentials to access data on your behalf. By default, bigrquery picks up httr’s policy of caching per-working-directory credentials in .httr-oauth.

Note that bigrquery requests permission to modify your data; in general, the only data created or modified by bigrquery are the temporary tables created as query results, unless you explicitly modify your own data (say by calling delete_table or insert_upload_job).

Warning: This could cost you some money! Though the limit on free requests is 10,000/day, the limit on free data processed is 1TB/month.  Since GDELT is 100GB, that means you can run approximately 10 free queries on the complete GDELT database monthly.  Use them well!  (If you need more, 5TB is only \$5 USD).
Congratulations! In just a few minutes, you have gained access to hyper-fast querying on a hundred-gigabyte dataset right within R.

While it doesn’t look like GDELT has much of a future in the academic community, perhaps this model (being hosted on Google BigQuery) is a valuable way for future massive, machine-coded datasets to be made accessible to the masses (I’m looking at you, Phoenix!). After all, any tool that reduces computation time by 3 orders of magnitude and fits seamlessly into the R environment we all already use is too valuable a prospect to ignore.

## GDELT's Unit of Analysis Problem1

In the course of the Event Data community’s reaction to Mona Chalabi’s use of GDELT to analyze events in Nigeria, some of us realized that a fundamental problem with GDELT is that its unit of analysis is unclear. This is a huge no-no.

The Unit of Analysis in a well-formed dataset is represented by a single row of data.  Some combination of the columns of that row will generate a predictable unique identifier.  For example, International Relations data tends to represent countries across years.  Hence we say that the Unit of analysis is the Country-year.  The should only be one row in such a dataset for Afghanistan in 1990, so we might create a unique identifier called AFG1990, and rest easily in the confidence that it will be unique.

Each row in GDELT represents one discrete event in theory.  GDELT takes many “events” from many articles, aggregates them into a groupings it believes represent the same event, and builds the data based on this grouping.  The problem is, it does this badly.  We simply cannot trust the mysterious overlord program that builds and updates the data to actually code an event correctly and then figure out it reported by dozen other “events” it has already coded and cluster them together.  One of Chalabi’s errors was to trust this claim of GDELT’s at face value.

The way that we mitigate this problem is by saying that GDELT measures reporting coverage of events.  This is true in a sense–reporting is one of the many filters through which the real, true information must be distilled before it comes to us in raw-data form.  However, this line of thinking seduces us into believing that each GDELT row constitutes a single story about an event.  This is wrong for two reasons: Firstly, GDELT doesn’t translate one article into one event.  It translates every sentence in an article into an event. Secondly, it collapses many events that it codes as the same (i.e. same location, same time, same actors, same action) into a single “event.”

But if we cannot trust these datapoints to represent real, meaningful, discrete events, we can at least use GDELT to measure the reporting of events.  How then can we recover this mysterious “reporting” variable which seems to be the only marginally trustworthy product of GDELT?  Well, GDELT provides 58 fields for each row, two of which are “NumMentions”, “NumSources”, and “NumArticles”.  From the GDELT Codebook:

NumMentions – This is the total number of mentions of this event across all source documents. Multiple references to an event within a single document also contribute to this count. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents and the density of events within them vary over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

NumSources – This is the total number of information sources containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of sources varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

NumArticles – This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

If you desire to employ GDELT to analyze reporting covering of something (for example, Kidnappings in Nigeria between 1-1-1979 and Now), take these steps:

1. Query the data you want from GDELT.
2. For your temporal unit of interest (days at minimum, probably years at maximum), sum any of the three variables listed above.
3. Take a logarithm (any logarithm–it doesn’t really matter that much) of your counts.  This will smooth out the natural exponential growth in reporting over the time period GDELT records.  This isn’t the most nuanced handling, but it’s sufficiently effective for a “big picture” analysis.

That will give you a pretty good measure of reporting coverage of your topic of interest.

## How to Fail at GDELT (with maps)!

Mona Chalabi is at it again with GDELT in Nigeria!  Once again, I have to give her props for digging with both hands into one of the richest data sources in social science today, and once again her crude handling of the data has lead her far beyond the realms of reasonable inference into wild conjecture.

Prompted by a reader, Chalabi decided to build a time-series visualization of GDELT on a map.  This is a popular treatment for such data, but inference from such things, as for all things GDELT (and perhaps all things social science), is a slippery business.  Let’s go through this piece-by-piece.

The 25,247 kidnappings that GDELT has recorded taking place in Nigeria since 1982 seem to have occurred across the country.

First thing: Why 1982?  GDELT goes back to 1979.  If there’s some compelling reason here, I don’t see it, but this is the most minor of Chalabi’s infringements.

Second and far more importantly, GDELT doesn’t count kidnappings.  GDELT counts reports.  If two different newspapers report the same thing two different ways, then GDELT will record them as two events.  This can be a subtle difference: “Boko Haram Kidnaps Girls” and “Abubakar Shekau kidnaps Girls”.

This point is so critical that it effectively invalidates any inference Chalabi makes beyond it.  Never-the-less, let’s trudge ahead:

The rapid acceleration in kidnappings in the past decade is obvious, but deciphering regional patterns is harder — especially for those of you who, like me, don’t have a detailed knowledge of Nigerian geography.

Someone who has never touched GDELT before can be forgiven for thinking this.  Chalabi cannot: Last time Chalabi used GDELT, she was kindly informed by “the helpful people at GDELT” (read: Kalev Leetaru) that GDELT grows exponentially.  Let me say that again slowly: GDELT. GROWS. EXPONENTIALLY. It says so right in the Paper! This means that anytime you see growth in any phenomena as recorded by GDELT, it’s probably a product of the fact that GDELT itself grew during that time.  I see no evidence of normalization to account for this fact.  She’s made this error twice now, and the second time after she was explicitly corrected. Accepting this and moving on…

I looked at population data and calculated the number of kidnappings per 100,000 residents in each of the country’s 37 states.

This is a somewhat crude calculation.

No kidding.

We’re counting all geolocated kidnappings in the GDELT database since 1982 and dividing that by each state’s current population. Official kidnapping statistics for Nigeria aren’t available, and our numbers do provide a good relative picture; we can see where kidnappings in Nigeria are most prevalent.

Just a reminder that this is incorrect: She’s counting reports of kidnapping, many of which will have significant overlap with one another.

The kidnapping rate is the highest — 120 for every 100,000 people — in the Federal Capital Territory, which includes Abuja, Nigeria’s capital.

I absolutely don’t believe this, and you shouldn’t either.  I don’t know that this is the case, but it looks a lot like that Federal Capital Territory sits very near the centroid of the nation.  If that’s the case, it is the default location for reports in Nigeria without more specific geolocation information.  Reports of Kidnappings attributed to this spot almost certainly happened elsewhere within Nigeria, but there’s no way to know where.

Three states in the south also stand out (Rivers, Delta and Bayelsa) with unusually high numbers of kidnappings relative to their population size. One possible explanation is the region’s oil wealth, otherwise known as the curse of the black gold. The United Nations news service has also highlighted how oil extraction in the south of Nigeria has been accompanied by violence and criminality.

Great!  Let’s invoke some controversial literature in developmental economics because it seems to fit our narrative.  Alternatively, it could just be because crime is more prevalent in urban areas.  Better yet, reporting coverage is better in urban areas.  A kidnapping of one girl in the north might be covered by the local rag (if at all).  A kidnapping one of the major urban centers might get picked up by the biggest regional or national paper, from which everyone else will begin reporting the same.

One other state that was well above the median kidnapping rate (of five kidnappings per 100,000 people) was Borno in the northeast. That’s where the militant group Boko Haram, which is responsible for the recent mass kidnapping, is based. When we filtered the results to look at the first four months of 2014, Borno had the highest kidnapping rate in Nigeria.

That couldn’t possibly have anything to do with the thousands of international news outlets covering the single kidnapping of 300 girls, could it?

Girls were taken from the Government Girls Secondary School in Chibok, which is in Borno state. At the substate level, Chibok has rapidly become more vulnerable to kidnappings; GDELT has recorded 649 kidnappings there in the first four months of this year. GDELT recorded one in 2013, and none before that.

There’s no support for this conclusion.  The very fact that the “trend” she describes moves from 0 to 1 to 649 over the course of three years only suggests that there’s some fundamental change in the coverage between these years.  The 649 number also comes from left-field: I’ve heard reports placing the number of kidnapped at around 300.  So Chalabi’s assumption is that, in addition to the 300 all the news is about, there are around 350 other people kidnapped from Borno alone.  Wrong!  There were 649 reports of kidnappings in her dataset! We can make absolutely no inference about the number of kidnappings. The error Chalabi made in the first sentence has infected every claim she makes in the piece.

I wasn’t the only one annoyed. Twitter was enragedErin Simpson in particular had a lot to say. She even put a bounty on a storified version of her rant, which I sadly lost by about five minutes.

## What's the deal with Kansas?

Dylan Matthews over at Vox recently issued a slight correction to an article of his about Pornhub’s look at the politics of its consumers (SFW, believe it or not).

Pornhub’s analysis suggested that Porn viewership in Kansas dominated that of all other states at 194 pageviews per capita.  This was a conspicuous outlier, as it bucked the trend that Democratic-leaning states tend to watch more porn than Republican-leaning states.  John Bieler bumped a similar issue with protests.  Either there’s a field in the middle of Nowhere, Kansas where people go to protest and watch lots of porn, or there’s a bug in the analysis.

When things like this happen in social science, it’s usually because of an error. Such things are always worth investigating.  If it is an error, then you can fix it and strengthen your analysis.  If it’s not an error, they you might have discovered something truly interesting.

In this case, it was certainly an error.  Which brings us back to Matthews at Vox and the correction. Pornhub analyzed IP-based geolocations.  In other words, it recorded the IP address of everyone who visited, and then ran the IP address through a geocoder.  Want to know where you are?

IP-based geolocation is a type of coarse geolocation–it’s not super-accurate.  Your phone can’t use it to figure out street directions, which is why you have a GPS chip.  GPS is called fine geolocation.  There are a bunch of IP addresses that are assigned to the United States.  Some of those have no regional mapping, meaning that anyone using them could be anywhere in the lower 48, so far as the server is concerned.  The server just records the IP and moves on.  Then, when pornhub wants to run its analysis, it passes that list of IPs to the geocoder.  When the geocoder gets one of these IPs that isn’t anywhere more specific than “United States”, it will return the centroid of the country.  Where do you think the centroid of the US is?

Now, attributing all this wierd stuff to Kansas isn’t the Geocoder’s fault.  Geocoding a centroid in the absence of more information is a good strategy!

Imagine you have to predict a person’s height, but you know nothing about that person except which country they’re from.  Let’s say we know they’re from the United States.  Well, what’s the average height of a person in the United States?  Wikipedia breaks it down by gender, which makes sense: most adult men are a little taller than most adult women.  But we don’t know the person’s gender.  We could take the median of the average height, or we could build a population-weighted mean.  But all of these solutions boil down to the same idea: take an average.

The centroid is the average of all coordinates in a country’s physical boundaries.  So it’s arguably the best guess we can provide about an event’s location in the absence of more information.  In a way, all  geocoded addresses are centroids: it takes the most specific geographic identifier it has (Country, State, County, City) and returns the centroid of that geographic region.

If you want to discuss a map with geocoded data on it, you better be certain you know about that!  Otherwise, you’ll end up correcting your articles after publication.  If you’re working with international geocoded data, Frank at Gothos has made your job easy by puting together a handy dataset with estimates for the centroids of every country.  Go forth, and be wary of claims that Kansas is special!

## Event Data in Popular Media

Mona Chalabi of 538’s Datalab took a stab at reporting on the mass kidnappings executed by Boko Haram in Nigeria.  While this is a very popular story at the moment, Datalab, 538, and the larger collection of trendy new news entities run by geeks offer a data-driven approach to news and analysis.  Now kidnappings, like most things in social science, are exceedingly difficult to find or collect data for.

The modern, popular approach to dealing with things which are hard to measure  is to use event data. The idea behind event data is to parse the contents of large quantities of news reports and use machine learning to turn that article’s text into a bunch of raw data. The most available source of event data at the moment is called the Global Database of Events, Locations, and Tone, or GDELT.  So it’s only natural that Chalabi should refer to GDELT to address this topic.

Chalabi’s argument is that Kidnappings are on the rise in Nigeria.  Chalabi offers this graph:

Anyone who’s actively worked with GDELT at any reasonable length (or read its foundational ISA Paper) will recognize the exponential nature of the growth in GDELT entries.  This occurs because GDELT grows exponentially over time. Jay Ulfelder observed this error on Twitter:

Luckily, someone (probably Kalev Leetaru himself) reached out to Chalabi to try to correct the course of the evidence.

You rarely want to plot raw numbers of events, since global news volume available in digital form has increased exponentially over the past 30 years (it increases by a fraction of a percent each day), meaning the total universe of events recorded in the news and available for processing has increased exponentially over that time. Instead, you want to divide the total number of what you are after (kidnappings in Nigeria) by all GDELT events for a given time period to get a normalized intensity measure.

Chalabi summarily made the modification, leading to this graphic:

Jay Ulfelder still wasn’t convinced:

GDELT is a notoriously noisy source of data. While I agree (with Ulfelder) that it’s a pretty tall claim to make that kidnappings are actually on the uptick, I’d like to emphatically add that it isn’t necessarily incorrect. GDELT is a powerful tool, and it may lead to some breakthroughs in the social sciences (its legal and academic challenges notwithstanding).  However, it’s critically important to remember that GDELT is powerful for its ability to show trends in broad strokes (months, not days) over long periods (decades, not years).  Diluting the counts over time in this way diminishes the effects a variety of errors, including the one Ulfelder highlighter. I hope Chalabi can take the lesson of this forward for future reporting using GDELT.

## The Best Statistical Programming Language is …Javascript?14

R-Bloggers has recently been buzzing about Julia, the new kid on the statistical programming block.  Julia, however, is hardly the sole contender for the market of R defectors, with Clojure-fork library Incanter generating buzz as well.  Even with these two making noise, I think there’s a huge point that everyone is missing, and it’s front-and-center on the Julia homepage:

Julia Python Matlab Octave R JavaScript fib 3f670da0 2.7.1 R2011a 3.4 2.14.2 V8 3.6.6.11 1.97 31.47 1336.37 2383.80 225.23 1.55 1.44 16.50 815.19 6454.50 337.52 2.17 1.49 55.84 132.71 3127.50 713.77 4.11 5.55 31.15 65.44 824.68 156.68 5.67 0.74 18.03 1.08 328.33 164.69 0.75 3.37 39.34 11.64 54.54 22.07 8.12 1.00 1.18 0.70 1.65 8.64 41.79

Julia kicks ass on the benchmarks, but it also has a severe uphill battle.  It’s new, it’s Linux, it’s command-line-only, and it doesn’t have support for the wide array of Statistical functionality available in R.  But besides the obvious my-language-can-beat-up-your-language comparisons, notice anything interesting?  Think orders of magnitude?

Javascript performs nearly as well as Julia down the board.  This nearly floored me.

## Finding the Global Population's Equilibrium

Last night’s post brought me a to final question on the population model with which I’ve been tinkering.  Will my model converge?  There are two ways to answer this question.  The first is kinda sorta calculus.  To start, we must collapse all those equations from the model into a single function.  We can do this because each equation in the model is contingent only upon other values in the system (none of which are feedback process except the canonical Births and Deaths), time (represented as the variable n), or the dependent variable, GlobalPop.  So the final function should show the relationship between time and Global Population.

First, the global population is equal to last year’s population plus the births minus the deaths.  The quintessential gain-loss model.

$$GlobalPop_{n}=GlobalPop_{n-1}*(1+Births-Deaths)$$

Let’s start by unwrapping Births (actually births per capita), which should be the more complicated of the two:

$$Births=FemaleProportionPop*BirthRate$$

$$Birthrate=frac{4.91-.049n}{MeanLifespan}$$

$$MeanLifespan=54.6+.345n$$

So, if we cram these together:

$$Births=(.497)frac{4.91-.049n}{54.6+.345n}$$

Now, for death, if we remove the GlobalPop variable the way we did for births (since it was pre-factored out in the equation 1), deaths (again, deaths per capita) is simply equal to the deathrate.  So,

$$Deaths=.018907n^{-0.21}$$

Thus, if we put this all together:

$$GlobalPop_{n}=GlobalPop_{n-1}*(1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21})$$

Since we can’t easily calculate GlobalPop into infinity this way (without some heavy-duty computation), let’s just think about the second half of the expression, and measure what happens as n approaches infinity.  If the output (which is basically a global population multiplier) hits 1, then an equilibrium state exists.  If not, then the global population should continue to fluctuate forever.  Let’s see:

$$f(n)=1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}$$

$$lim_{n to infty}f(n)=1+0-0$$

Interesting.  Both Births and Deaths converge on 0, so the equilibrium state exists.  How big is it?  How long will it take?

Equilibrium is the point at which the global population no longer changes from year to year, or if it does, it begins to do so cyclically.  There aren’t any cyclical parameters in the model, and there aren’t enough independent variables for cyclicality to emerge organically, so my suspicion is that we should see a single point of convergence.  In other words, the global population will reach some number and stick to it.  Algebraically, we can represent this as:

$$GlobalPop_{n}=GlobalPop_{n-1}$$

or, since we’re solving for n*, the equilibrium condition, we know GlobalPop at n is equal to the GlobalPop at n-1.  Thus we can divide both sides of the equation by GlobalPop and voila!

$$1=1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}$$

$$0=frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}$$

$$.018907n^{-0.21}=frac{.497*(4.91-.049n)}{54.6+.345n}$$

We have now reached a substantively interesting and intuitively obvious point: equilibrium is the state of the population in which the number of annual births equals the number of deaths.  Moving on:

$$.018907n^{-0.21}(54.6+.345n)=.497*(4.91-.049n)$$

Ah, time to pull out the good ol’ fashioned calculator!

$$1.0323222n^{-0.21}+.006523n^{.79}=2.44027-0.024353n$$

$$1.0323222n^{-0.21}+.006523n^{.79}+0.024353n-2.44027=0$$

$$1.0323222-2.44027n^{.21}+.006523n+0.024353n^{1.21}=0$$

Yuck.  This is a sick variation on the old factoring polynomials problem.  Since my Algebra 2 is failing me right now, let’s just calculate this using the old guess ‘n check method a couple of hundred thousand times over.

Years  Global Population   Birthrate   Deathrate   Mean Female Lifespan
1960    3,027,185,898       0.089927    0.018907    54.6
1961    3,109,658,894       0.088470    0.016346    54.945
1962    3,197,918,219       0.087032    0.015012    55.29
1963    3,289,893,157       0.085612    0.014132    55.635
...
2009    7,906,471,643       0.035088    8.31E-03    71.505
2010    7,977,393,040       0.034238    8.28E-03    71.85
2011    8,045,839,833       0.033396    8.25E-03    72.195
...
3214    4,901,896,085       4.23E-03    4.23E-03    487.23
3215    4,891,491,260       4.22E-03    4.23E-03    487.575
3216    4,881,104,717       4.22E-03    4.22E-03    487.92

And we’re done!  The global population should converge sometime in the 33rd century.  But that can’t be right: the population continues to change after that year.  In fact, it changes in an very interesting way:

And it all comes together!  The “Equilibrium” I was talking about earlier wasn’t an equilibrium at all.  In order for that to be true, the two variables must be contingent upon each other.  Births and deaths are indifferent to one another (in this model), so being equal didn’t create an equilibrium–it created a maxima.  That’s the peak you see there: one day, birthrates will have slowed down more than death rates, so that even though people live longer (and by longer I mean absurdly long–in excess of 1700 years by the end), they procreate so infrequently that the human species dies off, and not especially slowly in the grand scheme.

This brings that information we got from the limit approach into sharp relief: If no one is being born, and no one is dying, that’s either because we have a bunch of asexual immortals, or there aren’t any humans left.  The model says the latter, for posterity.  Of course, as they say, all models are wrong, and this one doubly so.  However, I’m most interested in the tipping point, and not the resolution: if current trends are maintained, the Global population shouldn’t exceed 17 Billion, which makes the UN’s high estimate seem very generous.  The whole world must be feeling very amorous to hit 35 Billion.

I cannot, however, based on this information, dismiss the low estimate.  In the absence of control or predictions for cataclysmic events, it seems to me that we should be braced for a world of bounded or diminishing population.