How to Setup a Shiny Server   Recently updated !

I recently embarked upon the strange, complicated journey of building a personal Shiny server.  This is how I did it:

 

Step 1: Amazon Web Services Account

Sign up for one.  AWS has a generous free tier: you get enough computing time on EC2 to run a small server continuously every month for a year.  After your year is up, they’ll start charging you for the server, but it isn’t expensive. (For comparison, I’m running a t2-micro, which runs full-time about $9.50/month).

 

Step 2: Provision an EC2 Instance

Once you’ve set up your AWS account (complete with credit card information, uncomfortable as that may be), head over to the EC2 Launch Wizard.  Select the “Ubuntu Server 14.04 LTS (HVM), SSD Volume Type”.  Most of the default configuration settings will be good enough, so just click “Next” through steps 3, 4 and 5.  Step 6 (“Configure Security Groups”) is rather important, so let’s stop there a moment.  It will allow SSH connections by default, but this isn’t permissive enough.  We need to be able to connect using HTTP (so people who visit the server on the web will be able to see things).  Click “Add Rule” and select “HTTP” from the dropdown on the left.  If you want to install RStudio, do this one more time, only leave the left dropdown alone and put 8787 in the “Port Range” box, and “Anywhere” on the dropdown on the right.  Here’s what this configuration should look like:

Screenshot from 2015-03-23 12:26:11

Click “Review and Launch.”  Now, when you click “Launch” on the next screen, it doesn’t immediately launch your new virtual machine.  Instead, it brings up a VERY IMPORTANT dialogue.  Here’s what it looks like:

Screenshot from 2015-03-23 12:28:30

Be absolutely certain that you create a new Key Pair, name it, and download it somewhere you’ll be able to find it.  Once you’ve done that, you click “Launch Instances” and your instance will start booting up.  Go to the Instances Screen to see your server.

 

Step 3: SSH to your new Instance

From the instances screen, you can see a bunch of information about the server.  The only thing about which we’re interested is the Public IP, which should be in one of the rightwardly columns.  Copy that IP Address so you can paste it into the command below.

Note: If you’re a Windows user, the most typical approach to doing this usually involves PuTTY.

If you’re on an Apple or *nix system, just open a terminal emulator, cd to the directory where you downloaded that secret key, and enter this:

ssh -i RStudio.pem ubuntu@YOUR.PUBLIC.IP

…where YOUR.PUBLIC.IP is the IP address you copied from the beginning of this step. This should give you a welcome message and a command prompt controlling the server.

If you have any problems, you can right-click on your instance in the instances menu and select “Connect”, which will give you much more verbose instructions about how to do this.  Plus a handy Java Applet if you are overcome by despair!

 

Step 4: Install R

From your SSH connection, let’s install R.  Ubuntu provides a version of R, but they don’t keep it up to date, so we want to add a CRAN link to our sources list.

sudo echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/' >> /etc/apt/sources.list

Now, we need to add the security key for R, update the software list, and install R:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get -y install r-cran

 

Step 5 (optional): Install RStudio

You don’t need RStudio to have or run Shiny Server, but it’s incredibly convenient to have both running on the same server.  That way, you have an R console that can edit the files in-place, and changes can be tested as quickly as you can change tabs and refresh.

RStudio maintains the step-by-step guide to installing RStudio, but here’s the short list of commands you need to run:

sudo apt-get -y install gdebi-core libapparmor1
wget http://download2.rstudio.org/rstudio-server-0.98.1103-amd64.deb
sudo gdebi rstudio-server-0.98.1103-amd64.deb

This gets RStudio on your system, but it doesn’t permit logins by default.  That’s because it uses a password for authentication, but the only eligible user on the system (ubuntu) is only configured to login by private key (remember that RStudio.pem file?).  To create a password for yourself:

sudo passwd ubuntu
> Enter new UNIX password: 
> Retype new UNIX password: 
> passwd: password updated successfully

Remember that password: whenever you go to your RStudio instance, you’ll need it to login. Which, by the way, you are now able to do.  Open your web browser and surf to http://YOUR.PUBLIC.IP:8787/  and enter username “ubuntu” and that password you just set.  You should now be looking at the glorious RStudio interface, right in your browser!  Take a moment to rejoice.

 

Step 6: Install Shiny Server

Once again, RStudio maintains the step-by-step guide to installing Shiny Server, but here’s the shortlist of requisite commands:

sudo su - -c "R -e \"install.packages(c('shiny','rmarkdown'), repos='http://cran.rstudio.com/')\""
sudo apt-get -y install gdebi-core
wget http://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.3.0.403-amd64.deb
sudo gdebi shiny-server-1.3.0.403-amd64.deb

 

Step 7: Configure Shiny Server

Check out the Administrator’s guide for much more detailed instructions.  Here are the only two config changes I made:

Permissions

When it’s installed, Shiny server creates a new user on the system.  This is all well and good, but it means that you’re going to get weird permissions errors that you can only fix by prepending ‘sudo’ to most of your commands, which is generally bad practice.  Furthermore, R packages are local to users, meaning that any packages you install using RStudio (if you installed RStudio Server) won’t be accessible to your Shiny Applications.  Thus, to fix all this, run these commands:

sudo sed -i 's/run_as shiny;/run_as ubuntu;/g' /etc/shiny-server/shiny-server.conf
sudo chown -R ubuntu /srv/shiny-server/
sudo chgrp -R ubuntu /srv/shiny-server/

Port

When you type in a URL, it doesn’t usually contain a port number.  By default, HTTP uses port 80, so http://aaboyles.com is (usually) equivalent to http://aaboyles.com:80.  Shiny server, however, uses port 3838 by default.  If you want to give out an address without a port for the Shiny Apps you host, you should change the port number in the Shiny Configuration file.  To do that, run this command:

sudo sed -i 's/3838/80/g' /etc/shiny-server/shiny-server.conf
sudo shiny-server restart

You should be able to visit http://YOUR.PUBLIC.IP and see the Shiny Server landing page, complete with two demo apps.

 

Step 8: Load up your Shiny Apps

The easiest way to get Shiny apps is usually to clone them with git, so let’s install git, and cd into the server’s served directory:

sudo apt-get -y install git
cd /srv/shiny-server/

From here, you only need to git clone your shiny apps, and they’ll be publically available at http://YOUR.PUBLIC.IP/REPO.

Rejoice! Bask in the glory of your greatness! You have conquered Shiny Server!

Access Event Data in Google BigQuery From R

I spent a good portion of this past spring working with GDELT. Unfortunately, a perfect storm of legal troubles and methodological criticisms this spring turned GDELT from the rising star of Computational Social Science into the ugly behemoth we’d all prefer not to discuss. I stopped updating my copy of GDELT and refocused my efforts to other projects.

So when Google Announced that GDELT would be hosted for free on BigQuery, I was only mildly interested (read: didn’t care). GDELT exhibits some ideal characteristics for a demonstration of high-speed data processing: It’s small enough that you could play with it on a well-equipped consumer computer, but large enough (currently ~100GB) that doing so is a terrible idea. Some of the queries I worked with, for example, would occupy hours of computing time on a fairly powerful machine. When I tried the some of the same queries on the BigQuery Console, it took ~10 sec, an improvement of three orders of magnitude.

But then I happened to be looking over the dplyr documentation and noticed that it had BigQuery integration, and I became slightly more interested. dplyr wraps the API from BigRQuery (one of Hadley‘s other projects). The effect of this is that we can integrate complex operations exceedingly large data sources into R workflows with little overhead.

Getting BigRQuery set up isn’t the most straight-forward project, but it isn’t too hard either. From the BigRQuery README:

If you just want to play around with the bigquery API, it’s easiest to start with the Google’s free sample data. To do that, you’ll also need to create your own project for billing purposes. If you’re just playing around, it’s unlikely that you’ll go over the 10,000 request/day free limit, but google still needs a project that it can bill (you don’t even need to provide a credit card).

To create a project:

  1. Open https://cloud.google.com/console
  2. Click “Create Project” at the top
  3. Select a name and project ID, and click “Create”
  4. Turn on the BigQuery API by clicking “APIs & Auth” on the left, scrolling down to “BigQuery API”, and clicking the button at the right from “OFF” to “ON”.
  5. Click on “Overview” at the left
  6. Either the Project ID or Project Number at the top can be used to identify your project withbigrquery.

From there, you can access GDELT through the BigQuery Browser Console. Theoretically, you should be able to access it through the API at this point, but I couldn’t get that working. Instead, I just made a copy of the events table and imported it into my own project.  To do that:

  1. Click on the dropdown arrow next to your project’s name in the left pane.  Select “Create a New Dataset”
  2. For Simplicity’s sake, just call it “GDELT”
  3. Under “gdelt-bq:full” in the left pane, click the dropdown arrow next to “events” and click on “Copy table”
  4. Copy the table to your project, to the GDELT dataset, and (again, for simplicity’s sake) call it “events”
  5. You can access the data in R and play with it in dplyr.  Copy your project’s billing ID and paste it into the following gist:

When you try this, also take note:

The first time you use bigrquery in a session, it will ask you to authorize bigrquery in the browser. This gives bigrquery the credentials to access data on your behalf. By default, bigrquery picks up httr’s policy of caching per-working-directory credentials in .httr-oauth.

Note that bigrquery requests permission to modify your data; in general, the only data created or modified by bigrquery are the temporary tables created as query results, unless you explicitly modify your own data (say by calling delete_table or insert_upload_job).

Warning: This could cost you some money! Though the limit on free requests is 10,000/day, the limit on free data processed is 1TB/month.  Since GDELT is 100GB, that means you can run approximately 10 free queries on the complete GDELT database monthly.  Use them well!  (If you need more, 5TB is only $5 USD).
Congratulations! In just a few minutes, you have gained access to hyper-fast querying on a hundred-gigabyte dataset right within R.

While it doesn’t look like GDELT has much of a future in the academic community, perhaps this model (being hosted on Google BigQuery) is a valuable way for future massive, machine-coded datasets to be made accessible to the masses (I’m looking at you, Phoenix!). After all, any tool that reduces computation time by 3 orders of magnitude and fits seamlessly into the R environment we all already use is too valuable a prospect to ignore.


GDELT's Unit of Analysis Problem 1

In the course of the Event Data community’s reaction to Mona Chalabi’s use of GDELT to analyze events in Nigeria, some of us realized that a fundamental problem with GDELT is that its unit of analysis is unclear. This is a huge no-no.

The Unit of Analysis in a well-formed dataset is represented by a single row of data.  Some combination of the columns of that row will generate a predictable unique identifier.  For example, International Relations data tends to represent countries across years.  Hence we say that the Unit of analysis is the Country-year.  The should only be one row in such a dataset for Afghanistan in 1990, so we might create a unique identifier called AFG1990, and rest easily in the confidence that it will be unique.

Each row in GDELT represents one discrete event in theory.  GDELT takes many “events” from many articles, aggregates them into a groupings it believes represent the same event, and builds the data based on this grouping.  The problem is, it does this badly.  We simply cannot trust the mysterious overlord program that builds and updates the data to actually code an event correctly and then figure out it reported by dozen other “events” it has already coded and cluster them together.  One of Chalabi’s errors was to trust this claim of GDELT’s at face value.

The way that we mitigate this problem is by saying that GDELT measures reporting coverage of events.  This is true in a sense–reporting is one of the many filters through which the real, true information must be distilled before it comes to us in raw-data form.  However, this line of thinking seduces us into believing that each GDELT row constitutes a single story about an event.  This is wrong for two reasons: Firstly, GDELT doesn’t translate one article into one event.  It translates every sentence in an article into an event. Secondly, it collapses many events that it codes as the same (i.e. same location, same time, same actors, same action) into a single “event.”

But if we cannot trust these datapoints to represent real, meaningful, discrete events, we can at least use GDELT to measure the reporting of events.  How then can we recover this mysterious “reporting” variable which seems to be the only marginally trustworthy product of GDELT?  Well, GDELT provides 58 fields for each row, two of which are “NumMentions”, “NumSources”, and “NumArticles”.  From the GDELT Codebook:

NumMentions – This is the total number of mentions of this event across all source documents. Multiple references to an event within a single document also contribute to this count. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents and the density of events within them vary over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

NumSources – This is the total number of information sources containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of sources varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

NumArticles – This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […]

If you desire to employ GDELT to analyze reporting covering of something (for example, Kidnappings in Nigeria between 1-1-1979 and Now), take these steps:

  1. Query the data you want from GDELT.
  2. For your temporal unit of interest (days at minimum, probably years at maximum), sum any of the three variables listed above.
  3. Take a logarithm (any logarithm–it doesn’t really matter that much) of your counts.  This will smooth out the natural exponential growth in reporting over the time period GDELT records.  This isn’t the most nuanced handling, but it’s sufficiently effective for a “big picture” analysis.

That will give you a pretty good measure of reporting coverage of your topic of interest.


How to Fail at GDELT (with maps)!

Mona Chalabi is at it again with GDELT in Nigeria!  Once again, I have to give her props for digging with both hands into one of the richest data sources in social science today, and once again her crude handling of the data has lead her far beyond the realms of reasonable inference into wild conjecture.

Prompted by a reader, Chalabi decided to build a time-series visualization of GDELT on a map.  This is a popular treatment for such data, but inference from such things, as for all things GDELT (and perhaps all things social science), is a slippery business.  Let’s go through this piece-by-piece.

The 25,247 kidnappings that GDELT has recorded taking place in Nigeria since 1982 seem to have occurred across the country.

First thing: Why 1982?  GDELT goes back to 1979.  If there’s some compelling reason here, I don’t see it, but this is the most minor of Chalabi’s infringements.

Second and far more importantly, GDELT doesn’t count kidnappings.  GDELT counts reports.  If two different newspapers report the same thing two different ways, then GDELT will record them as two events.  This can be a subtle difference: “Boko Haram Kidnaps Girls” and “Abubakar Shekau kidnaps Girls”.

This point is so critical that it effectively invalidates any inference Chalabi makes beyond it.  Never-the-less, let’s trudge ahead:

The rapid acceleration in kidnappings in the past decade is obvious, but deciphering regional patterns is harder — especially for those of you who, like me, don’t have a detailed knowledge of Nigerian geography.

Someone who has never touched GDELT before can be forgiven for thinking this.  Chalabi cannot: Last time Chalabi used GDELT, she was kindly informed by “the helpful people at GDELT” (read: Kalev Leetaru) that GDELT grows exponentially.  Let me say that again slowly: GDELT. GROWS. EXPONENTIALLY. It says so right in the Paper! This means that anytime you see growth in any phenomena as recorded by GDELT, it’s probably a product of the fact that GDELT itself grew during that time.  I see no evidence of normalization to account for this fact.  She’s made this error twice now, and the second time after she was explicitly corrected. Accepting this and moving on…

I looked at population data and calculated the number of kidnappings per 100,000 residents in each of the country’s 37 states.

Nigeria

This is a somewhat crude calculation.

No kidding.

We’re counting all geolocated kidnappings in the GDELT database since 1982 and dividing that by each state’s current population. Official kidnapping statistics for Nigeria aren’t available, and our numbers do provide a good relative picture; we can see where kidnappings in Nigeria are most prevalent.

Just a reminder that this is incorrect: She’s counting reports of kidnapping, many of which will have significant overlap with one another.

The kidnapping rate is the highest — 120 for every 100,000 people — in the Federal Capital Territory, which includes Abuja, Nigeria’s capital.

I absolutely don’t believe this, and you shouldn’t either.  I don’t know that this is the case, but it looks a lot like that Federal Capital Territory sits very near the centroid of the nation.  If that’s the case, it is the default location for reports in Nigeria without more specific geolocation information.  Reports of Kidnappings attributed to this spot almost certainly happened elsewhere within Nigeria, but there’s no way to know where.

Three states in the south also stand out (Rivers, Delta and Bayelsa) with unusually high numbers of kidnappings relative to their population size. One possible explanation is the region’s oil wealth, otherwise known as the curse of the black gold. The United Nations news service has also highlighted how oil extraction in the south of Nigeria has been accompanied by violence and criminality.

Great!  Let’s invoke some controversial literature in developmental economics because it seems to fit our narrative.  Alternatively, it could just be because crime is more prevalent in urban areas.  Better yet, reporting coverage is better in urban areas.  A kidnapping of one girl in the north might be covered by the local rag (if at all).  A kidnapping one of the major urban centers might get picked up by the biggest regional or national paper, from which everyone else will begin reporting the same.

One other state that was well above the median kidnapping rate (of five kidnappings per 100,000 people) was Borno in the northeast. That’s where the militant group Boko Haram, which is responsible for the recent mass kidnapping, is based. When we filtered the results to look at the first four months of 2014, Borno had the highest kidnapping rate in Nigeria.

That couldn’t possibly have anything to do with the thousands of international news outlets covering the single kidnapping of 300 girls, could it?

Girls were taken from the Government Girls Secondary School in Chibok, which is in Borno state. At the substate level, Chibok has rapidly become more vulnerable to kidnappings; GDELT has recorded 649 kidnappings there in the first four months of this year. GDELT recorded one in 2013, and none before that.

There’s no support for this conclusion.  The very fact that the “trend” she describes moves from 0 to 1 to 649 over the course of three years only suggests that there’s some fundamental change in the coverage between these years.  The 649 number also comes from left-field: I’ve heard reports placing the number of kidnapped at around 300.  So Chalabi’s assumption is that, in addition to the 300 all the news is about, there are around 350 other people kidnapped from Borno alone.  Wrong!  There were 649 reports of kidnappings in her dataset! We can make absolutely no inference about the number of kidnappings. The error Chalabi made in the first sentence has infected every claim she makes in the piece.

I wasn’t the only one annoyed. Twitter was enragedErin Simpson in particular had a lot to say. She even put a bounty on a storified version of her rant, which I sadly lost by about five minutes.


What's the deal with Kansas?

Dylan Matthews over at Vox recently issued a slight correction to an article of his about Pornhub’s look at the politics of its consumers (SFW, believe it or not).

Pornhub’s analysis suggested that Porn viewership in Kansas dominated that of all other states at 194 pageviews per capita.  This was a conspicuous outlier, as it bucked the trend that Democratic-leaning states tend to watch more porn than Republican-leaning states.  John Bieler bumped a similar issue with protests.  Either there’s a field in the middle of Nowhere, Kansas where people go to protest and watch lots of porn, or there’s a bug in the analysis.

When things like this happen in social science, it’s usually because of an error. Such things are always worth investigating.  If it is an error, then you can fix it and strengthen your analysis.  If it’s not an error, they you might have discovered something truly interesting.

In this case, it was certainly an error.  Which brings us back to Matthews at Vox and the correction. Pornhub analyzed IP-based geolocations.  In other words, it recorded the IP address of everyone who visited, and then ran the IP address through a geocoder.  Want to know where you are?

IP-based geolocation is a type of coarse geolocation–it’s not super-accurate.  Your phone can’t use it to figure out street directions, which is why you have a GPS chip.  GPS is called fine geolocation.  There are a bunch of IP addresses that are assigned to the United States.  Some of those have no regional mapping, meaning that anyone using them could be anywhere in the lower 48, so far as the server is concerned.  The server just records the IP and moves on.  Then, when pornhub wants to run its analysis, it passes that list of IPs to the geocoder.  When the geocoder gets one of these IPs that isn’t anywhere more specific than “United States”, it will return the centroid of the country.  Where do you think the centroid of the US is?

 

Now, attributing all this wierd stuff to Kansas isn’t the Geocoder’s fault.  Geocoding a centroid in the absence of more information is a good strategy!

Imagine you have to predict a person’s height, but you know nothing about that person except which country they’re from.  Let’s say we know they’re from the United States.  Well, what’s the average height of a person in the United States?  Wikipedia breaks it down by gender, which makes sense: most adult men are a little taller than most adult women.  But we don’t know the person’s gender.  We could take the median of the average height, or we could build a population-weighted mean.  But all of these solutions boil down to the same idea: take an average.

The centroid is the average of all coordinates in a country’s physical boundaries.  So it’s arguably the best guess we can provide about an event’s location in the absence of more information.  In a way, all  geocoded addresses are centroids: it takes the most specific geographic identifier it has (Country, State, County, City) and returns the centroid of that geographic region.

If you want to discuss a map with geocoded data on it, you better be certain you know about that!  Otherwise, you’ll end up correcting your articles after publication.  If you’re working with international geocoded data, Frank at Gothos has made your job easy by puting together a handy dataset with estimates for the centroids of every country.  Go forth, and be wary of claims that Kansas is special!


Event Data in Popular Media

Mona Chalabi of 538’s Datalab took a stab at reporting on the mass kidnappings executed by Boko Haram in Nigeria.  While this is a very popular story at the moment, Datalab, 538, and the larger collection of trendy new news entities run by geeks offer a data-driven approach to news and analysis.  Now kidnappings, like most things in social science, are exceedingly difficult to find or collect data for.

The modern, popular approach to dealing with things which are hard to measure  is to use event data. The idea behind event data is to parse the contents of large quantities of news reports and use machine learning to turn that article’s text into a bunch of raw data. The most available source of event data at the moment is called the Global Database of Events, Locations, and Tone, or GDELT.  So it’s only natural that Chalabi should refer to GDELT to address this topic.

Chalabi’s argument is that Kidnappings are on the rise in Nigeria.  Chalabi offers this graph:

Anyone who’s actively worked with GDELT at any reasonable length (or read its foundational ISA Paper) will recognize the exponential nature of the growth in GDELT entries.  This occurs because GDELT grows exponentially over time. Jay Ulfelder observed this error on Twitter:

Luckily, someone (probably Kalev Leetaru himself) reached out to Chalabi to try to correct the course of the evidence.

You rarely want to plot raw numbers of events, since global news volume available in digital form has increased exponentially over the past 30 years (it increases by a fraction of a percent each day), meaning the total universe of events recorded in the news and available for processing has increased exponentially over that time. Instead, you want to divide the total number of what you are after (kidnappings in Nigeria) by all GDELT events for a given time period to get a normalized intensity measure.

Chalabi summarily made the modification, leading to this graphic:

Jay Ulfelder still wasn’t convinced:

GDELT is a notoriously noisy source of data. While I agree (with Ulfelder) that it’s a pretty tall claim to make that kidnappings are actually on the uptick, I’d like to emphatically add that it isn’t necessarily incorrect. GDELT is a powerful tool, and it may lead to some breakthroughs in the social sciences (its legal and academic challenges notwithstanding).  However, it’s critically important to remember that GDELT is powerful for its ability to show trends in broad strokes (months, not days) over long periods (decades, not years).  Diluting the counts over time in this way diminishes the effects a variety of errors, including the one Ulfelder highlighter. I hope Chalabi can take the lesson of this forward for future reporting using GDELT.


The Best Statistical Programming Language is …Javascript? 14

R-Bloggers has recently been buzzing about Julia, the new kid on the statistical programming block.  Julia, however, is hardly the sole contender for the market of R defectors, with Clojure-fork library Incanter generating buzz as well.  Even with these two making noise, I think there’s a huge point that everyone is missing, and it’s front-and-center on the Julia homepage:

Julia Python Matlab Octave R JavaScript
3f670da0 2.7.1 R2011a 3.4 2.14.2 V8 3.6.6.11
fib 1.97 31.47 1336.37 2383.80 225.23 1.55
parse_int 1.44 16.50 815.19 6454.50 337.52 2.17
quicksort 1.49 55.84 132.71 3127.50 713.77 4.11
mandel 5.55 31.15 65.44 824.68 156.68 5.67
pi_sum 0.74 18.03 1.08 328.33 164.69 0.75
rand_mat_stat 3.37 39.34 11.64 54.54 22.07 8.12
rand_mat_mul 1.00 1.18 0.70 1.65 8.64 41.79

Julia kicks ass on the benchmarks, but it also has a severe uphill battle.  It’s new, it’s Linux, it’s command-line-only, and it doesn’t have support for the wide array of Statistical functionality available in R.  But besides the obvious my-language-can-beat-up-your-language comparisons, notice anything interesting?  Think orders of magnitude?

Javascript performs nearly as well as Julia down the board.  This nearly floored me.


Finding the Global Population's Equilibrium

Last night’s post brought me a to final question on the population model with which I’ve been tinkering.  Will my model converge?  There are two ways to answer this question.  The first is kinda sorta calculus.  To start, we must collapse all those equations from the model into a single function.  We can do this because each equation in the model is contingent only upon other values in the system (none of which are feedback process except the canonical Births and Deaths), time (represented as the variable n), or the dependent variable, GlobalPop.  So the final function should show the relationship between time and Global Population.

First, the global population is equal to last year’s population plus the births minus the deaths.  The quintessential gain-loss model.

\(GlobalPop_{n}=GlobalPop_{n-1}*(1+Births-Deaths)\)

Let’s start by unwrapping Births (actually births per capita), which should be the more complicated of the two:

\(Births=FemaleProportionPop*BirthRate\)

\(Birthrate=frac{4.91-.049n}{MeanLifespan}\)

\(MeanLifespan=54.6+.345n\)

So, if we cram these together:

\(Births=(.497)frac{4.91-.049n}{54.6+.345n}\)

Now, for death, if we remove the GlobalPop variable the way we did for births (since it was pre-factored out in the equation 1), deaths (again, deaths per capita) is simply equal to the deathrate.  So,

\(Deaths=.018907n^{-0.21}\)

Thus, if we put this all together:

\(GlobalPop_{n}=GlobalPop_{n-1}*(1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21})\)

Since we can’t easily calculate GlobalPop into infinity this way (without some heavy-duty computation), let’s just think about the second half of the expression, and measure what happens as n approaches infinity.  If the output (which is basically a global population multiplier) hits 1, then an equilibrium state exists.  If not, then the global population should continue to fluctuate forever.  Let’s see:

\(f(n)=1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}\)

\(lim_{n to infty}f(n)=1+0-0\)

Interesting.  Both Births and Deaths converge on 0, so the equilibrium state exists.  How big is it?  How long will it take?

Equilibrium is the point at which the global population no longer changes from year to year, or if it does, it begins to do so cyclically.  There aren’t any cyclical parameters in the model, and there aren’t enough independent variables for cyclicality to emerge organically, so my suspicion is that we should see a single point of convergence.  In other words, the global population will reach some number and stick to it.  Algebraically, we can represent this as:

\(GlobalPop_{n}=GlobalPop_{n-1}\)

or, since we’re solving for n*, the equilibrium condition, we know GlobalPop at n is equal to the GlobalPop at n-1.  Thus we can divide both sides of the equation by GlobalPop and voila!

\(1=1+frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}\)

\(0=frac{.497*(4.91-.049n)}{54.6+.345n}-.018907n^{-0.21}\)

\(.018907n^{-0.21}=frac{.497*(4.91-.049n)}{54.6+.345n}\)

We have now reached a substantively interesting and intuitively obvious point: equilibrium is the state of the population in which the number of annual births equals the number of deaths.  Moving on:

\(.018907n^{-0.21}(54.6+.345n)=.497*(4.91-.049n)\)

Ah, time to pull out the good ol’ fashioned calculator!

\(1.0323222n^{-0.21}+.006523n^{.79}=2.44027-0.024353n\)

\(1.0323222n^{-0.21}+.006523n^{.79}+0.024353n-2.44027=0\)

\(1.0323222-2.44027n^{.21}+.006523n+0.024353n^{1.21}=0\)

Yuck.  This is a sick variation on the old factoring polynomials problem.  Since my Algebra 2 is failing me right now, let’s just calculate this using the old guess ‘n check method a couple of hundred thousand times over.

Years  Global Population   Birthrate   Deathrate   Mean Female Lifespan
1960    3,027,185,898       0.089927    0.018907    54.6
1961    3,109,658,894       0.088470    0.016346    54.945
1962    3,197,918,219       0.087032    0.015012    55.29
1963    3,289,893,157       0.085612    0.014132    55.635
...
2009    7,906,471,643       0.035088    8.31E-03    71.505
2010    7,977,393,040       0.034238    8.28E-03    71.85
2011    8,045,839,833       0.033396    8.25E-03    72.195
...
3214    4,901,896,085       4.23E-03    4.23E-03    487.23
3215    4,891,491,260       4.22E-03    4.23E-03    487.575
3216    4,881,104,717       4.22E-03    4.22E-03    487.92

And we’re done!  The global population should converge sometime in the 33rd century.  But that can’t be right: the population continues to change after that year.  In fact, it changes in an very interesting way:

veryLongTerm

And it all comes together!  The “Equilibrium” I was talking about earlier wasn’t an equilibrium at all.  In order for that to be true, the two variables must be contingent upon each other.  Births and deaths are indifferent to one another (in this model), so being equal didn’t create an equilibrium–it created a maxima.  That’s the peak you see there: one day, birthrates will have slowed down more than death rates, so that even though people live longer (and by longer I mean absurdly long–in excess of 1700 years by the end), they procreate so infrequently that the human species dies off, and not especially slowly in the grand scheme.

This brings that information we got from the limit approach into sharp relief: If no one is being born, and no one is dying, that’s either because we have a bunch of asexual immortals, or there aren’t any humans left.  The model says the latter, for posterity.  Of course, as they say, all models are wrong, and this one doubly so.  However, I’m most interested in the tipping point, and not the resolution: if current trends are maintained, the Global population shouldn’t exceed 17 Billion, which makes the UN’s high estimate seem very generous.  The whole world must be feeling very amorous to hit 35 Billion.

I cannot, however, based on this information, dismiss the low estimate.  In the absence of control or predictions for cataclysmic events, it seems to me that we should be braced for a world of bounded or diminishing population.


Measuring Global Population using System Dynamics 2

For my zombie population dynamics paper last fall, I decided to induce the global effects of an outbreak of zombiism by starting with an empirically accurate global population model.  Looking back, that baseline model is a really cool way to demonstrate system dynamic modeling.  System dynamic modeling is a process by which we define the causal relationships between variables using logical and mathematical functions, and then use those in turn to forecast future effects.  This sort of modeling works very well with anything that can be defined in terms of population counts.  The canonical example is the Lotka-Volterra predator-prey model.  In this model, a predator population thrives on a population of prey.  As one population grows, the other shrinks, and vice versa.

The global population model is a bit simpler since we’re dealing with a single population (that is, humans).  We’ll call this a gain-loss model.  Gains are defined by births, and losses by deaths.  After all, when modeling the entirety of humanity, there are only these two ways to change the global population (i.e. humanity’s total population count is unaffected by migration, divine ascension, etc…).  Let’s take a look at that visually:

The Box at the center represents the entire human population.  The “flow” to the left of the box indicates births increasing the number of humans in the population, and the one to the right is deaths decreasing the total population.  The two “converters” (circles at the top left and right) define the rates at which births and deaths occur.  This is a cool setup because we can define any of these as numerical values or as functions.  For example:

\(Births=GlobalPop*BirthRate\)

and likewise with the deaths.  Thus,

\(GlobalPop_{n+1}=GlobalPop_n(1+BirthRate_n-DeathRate_n)\)

Now, in order to define these numerical values, we must consult some statistics.  For this, there is the World Development Indicators.  I parsed the very same into separate datasets not so long ago, but this time it’ll be easier to just use the graphs.  For that, Google has set up a really nifty Public Data Explorer.  To start with, let’s look at Birthrate.  The graph looks pretty nonlinear, the take home point is that it’s trending downwards.  As such, we can represent it linearly with the knowledge that this may be something to revisit later on the the model construction.  Over the 49 year period from 1960-2009, the value drops from 4.91 to 2.52.  That’s a change of  2.39, or a mean change of .049 annually.  But what does this number actually mean?

“The average number of births per woman.”

Crap.  This creates two new problems.  The first is pretty clear: We will know how many people there are at any given time, but not how many women.  We’ll have to include another converter to capture this fact.  The second is much more subtle: this is evaluated over steps of time (in this case, years).  But the rate is measured over a woman’s lifespan.  So we also must include a variable which represents the mean female lifespan.  All told, here’s our model:

where:

\(MeanLifespan=54.6+.345n\)

\(Birthrate=(4.91-.049n)/MeanLifespan\)

Where the numerical values are derived from time series regressions on the data for female life expectancy and fertility rate.  As for the percent of the population which is female, it looks to be roughly stagnant around 49.7, so a fixed value should be good enough.

Now, all that remains is to define the Deathrate and test the model!  Deathrate presents a new problem–a truly nonlinear relationship.  As such, we’ll divide it by 1000 to give us a per capita probability of death per year, and fit the prediction curve using exponential regression, giving us this equation:

\(Deathrate=.018907*n^{-0.21}\)

This curve actually gives a strikingly good fit to the data at hand:

With all of our values set and relationships defined, its high time we tested this puppy out!  We’ll set the initial population equal to the population in 1960, approximately 3 billion.  We should see population growth up to today’s estimate of 6.775 billion.  Let’s check it out, shall we?

Not too shabby.  Hardly perfect, but considering everything that can go wrong, I think we did pretty OK!  Two dangling observations: First, we started out modeling a phenomena which is strikingly linear.  Second, our prediction is slightly nonlinear.  Slightly, but conspicuously.  System Dynamic modeling is excellent at representing nonlinear phenomena, so it’s a little gratuitous to use it in a linear case.  Never the less, our model preforms very well:

We can see here the R-squared of .98, indicating a high degree of accuracy.  The Beta is pretty good as well–if it were perfect, we would see a beta of exactly 1.  However, if we compare this to a linear regression on the time series data, we look a little worse:

Not so good.  Oh well.  There are a few sources of this error:

  1. The life expectancy is latent–it describes the life expectancy of people born in that year, not the life expectancy of people living in that year.
  2. We didn’t describe the female percentage of the population with a function, which makes it entirely inflexible.
  3. The pseudo-linear models we use to generate many of the coefficients are imperfect (though admittedly very good in this case).
  4. The data are imperfect: the World Development Indicators are calculated using a completely different methodology, which one hopes would be fairly accurate.  Even so, it would be logistically impossible to count every single person on the planet at a fixed point in time–people are born and die every minute.  The WDIs are correct to the ballpark (as are the models we devise from them), but only so.

Even so, I claim we’ve accomplished something moderately cool.  We’ve set up a complex equation set which predicts the global population, controlling for life expectancy, fertility, proportion of the population which is female, and per capita deathrate.  Our model could survive an exogenous shock to any of these, while the linear model is left scratching its head.  System Dynamic modeling is cool because it is necessarily representative of the causal linkages between variables, thereby creating much more compelling forecasts.  This model hinges upon the irrefutable causality between births, deaths, and population size.  Maybe next time we can try something a little more debatable, and shed some light on something new.


Probability of World Wars 1

The Pew Research Center performed a survey polling people about the fate of the world in 2050. The results are fascinating, but there’s one in particular I’d like to pull apart. 58% of people believe there will likely be another world war between now and then.

Here’s the thing, though: We can calculate the probability that such a war will occur within this time frame.  Back in the 40’s, a Meteorologist named Lewis Fry Richardson figured out that armed conflict follows a Power Law Distribution.  The bigger the war, the less likely such a war will ever occur.  Thus we can calculate the probability that some random war meets the scale of a world war (more than 10 million casualties).  Harvard’s Lars Erik Cederman replicated this result in a paper in which he took the Correlates of War for an even more accurate fit than Richardson had the resources to even dream about.  He generated this graph to describe it.

Screenshot from 2014-09-03 13:26:48

So, let’s assume that the term “World War” means a war in which the scale of casualties is measured in the tens of millions, like the prior two world wars.  That means that for the purposes of the above equation, \(log s approx 7\).  Plugging this into Cederman’s Model:

\(log P(S>s)=1.27-.41 log s\)

\(log P(S>s)=1.27-.41(7)\)

\(log P(S>s)=-1.6\)

\(P(S>s) approx 0.025\)

Thus the probability that any random war will reach the scale of a World War is .025 (or 2.5%).  OK, so what about those random wars?  The Correlates of War project lists 86 interstate conflicts that had 1000 or more fatalities, over the course of 185 years (1816-2001).  That’s a rate of approximately \(86/185 =.465\) wars per year.  So we should expect \(.465*40 approx 19\) wars.  Now, what is the probability that one of those will be World War III?

This is an interesting variation on the Birthday Problem.  We can’t calculate the probability that the big war is in our upcoming wars directly.  However, we can calculate the probability that our set of predicted wars does not contain WW3.  Here’s how it works:  The probability of the first war being the big war is approximately 2.5%.  So the probability of the second war being the big war is 2.5%.  And so on.  Now, the probability that the first war or the second war is the big war is not 2.5% + 2.5% (as is the intuition).  If that were the case, we would see a world war within 41 wars with probability greater than 1, and that can’t be true.  However, the probability that neither the first war nor the second war is the big war is \(.975*.975=.951\).  So, a generic equation for the probability that a World War does not exist within n wars would be

\(P(World War not in n wars)=.975^n\)

Now, from here it’s just a small logical step to the probability that a World war occurs within n wars.  It either doesn’t happen (the probability we just calculated), or it does (the probability that’s left).  So a generic equation for the probability that some war in n wars is a World War is

\(P(World War in n wars)=1-(.975^n)\)

We can use this equation straight up to figure out if our estimate of 19 wars within the next 40 years contains world war.  Just set n = 19.  So there’s a \(1-(.975^{19})=.382\) chance we will see an earth-shattering world war in any given 40 year span, the next 40 years included.  Personally, I’m of the camp that thinks it won’t happen.  But then, going with your gut is just punditry.  Better to say I’m 61.8% certain it won’t happen.  58% of the population is probably wrong.

References:

  1. Cederman, L. E. (2003). Modelling the size of wars: from billiard balls to sandpiles. American Political Science Review, 97, 135–150.
  2. Richardson, L.F. (1960). Statistics of deadly quarrels. Pacific Grove, CA: Boxwood Press.