Hadley Wickham has been on a Data Formats tear lately.  Since the beginning of the year, he’s reengineered the way R ingests data from Flat files, Excel spreadsheets, and other statistics packages.  If you include the fundamental ways in which dplyr alters database querying, Hadley’s basically rewritten how R ingests any data.  Here’s a quick once-over on how to use these.

But first things first. Like much of the Hadleyverse, it’s usually best to get the latest version from Github rather than use the CRAN version.  So we’re going to need devtools:

if(!require("devtools")){
install.packages("devtools")
library("devtools")
}

To read flat files, Hadley wrote a new package called readr. It’s a very common-sense library.  To load a file of a certain type, the command is probably read_filetype(“path/to/file”).  Notice the underscore in place of base R’s period in read.csv function.  But to get started, let’s write a CSV file from a sample data.frame:

mtcars_path <- tempfile(fileext = ".csv")
write_csv(mtcars, mtcars_path)

To create a data.frame from a flat file, there are a six functions for six different use cases

In addition to data.frames, readr also provides a means to ingesting arbitrary text files as less-structured data.  For example, it can read a file line-wise using read_lines(), generating a vector of strings.

It can also ingest a complete text file into a single string using read_file().

Finally, it provides a means of parsing strings in the columns of existing data frames using type_convert()

df <- data.frame(
x = as.character(runif(10)),
y = as.character(sample(10)),
stringsAsFactors = FALSE
)
str(df)
str(type_convert(df))

Hadley wrote the readxl package to make it easy to get data out of Excel and into R. The simple trick behind this was to require no external dependencies, unlike the pre-existing solutions.

read_excel reads both xls and xlsx files.  Just pass it the path of the file you want to load.  Here’s a sample Excel file containing the the iris, mtcars, chickwts, and quakes sample datasets, each in a different sheet of the same file.

# Specify sheet with a number or name
• Re-encodes non-ASCII characters to UTF-8.
• Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.
• Blank columns and rows are automatically dropped.
• It returns data frames with additional tbl_df class, so if you have dplyr loaded, you get nicer printing.

The package includes an example file created with openxlsx:

## From Other Statistics Environments: haven

Haven allows you to load foreign data formats (SAS, Spss and Stata) in to R. It is simply an R wrapper for Evan Miller’s ReadStat library. Haven offers similar functionality to the base foreign package but:

• Can read SAS’s proprietary binary format (SAS7BDAT). The one other package on CRAN that does that, sas7bdat, was created to document the reverse-engineering effort. Thus its implementation is designed for experimentation, rather than efficiency. Haven is significantly faster and should also support a wider range of SAS files, and works with SAS7BCAT files.
• It can be faster. Some spss files seem to load about 4x faster, but others load slower. If you have a lot of SPSS files to import, you might want to benchmark both and pick the fastest.
• Works with Stata 13 files (foreign only works up to Stata 12).
• Can also write SPSS and Stata files (This is hard to test so if you run into any problems, please let me know).
• Can only read the data from the most common statistical packages (SAS, Stata and SPSS).
• You always get a data frame, date times are converted to corresponding R classes and labelled vectors are returned as new labelled class. You can easily coerce to factors or replace labelled values with missings as appropriate. If you also use dplyr, you’ll notice that large data frames are printed in a convenient way.
• Uses underscores instead of dots 😉

## From Databases: dplyr

While dplyr is the oldest package I’m going to discuss here, I haven’t seen much adoption of the functionality in which I’m interested here.  dplyr generically wraps databases!  This means that instead of starting with a data.frame or data.table, you can start a dplyr chain with a database connection.  From there you construct your data pipeline using dplyr verbs, and dplyr will opaquely translate them into raw SQL, with which it then queries the database.  This is genius: it means that instead of using R to reinvent the most efficient strategy for outputting the exact data desired, it outsources that task to the database engine (which is already highly optimized for this task).

my_db <- src_sqlite("my_db.sqlite3", create = T)
cars_sqlite <- copy_to(my_db, mtcars, temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum"))
flights_sqlite <- tbl(cars_sqlite, "flights")

c4 <- flights_sqlite %>%
filter(year == 2013, month == 1, day == 1) %>%
select(c1, year, month, day, carrier, dep_delay, air_time, distance) %>%
mutate(c2, speed = distance / air_time * 60) %>%
arrange(c3, year, month, day, carrier)

## Step 2: Provision an EC2 Instance

Once you’ve set up your AWS account (complete with credit card information, uncomfortable as that may be), head over to the EC2 Launch Wizard.  Select the “Ubuntu Server 14.04 LTS (HVM), SSD Volume Type”.  Most of the default configuration settings will be good enough, so just click “Next” through steps 3, 4 and 5.  Step 6 (“Configure Security Groups”) is rather important, so let’s stop there a moment.  It will allow SSH connections by default, but this isn’t permissive enough.  We need to be able to connect using HTTP (so people who visit the server on the web will be able to see things).  Click “Add Rule” and select “HTTP” from the dropdown on the left.  If you want to install RStudio, do this one more time, only leave the left dropdown alone and put 8787 in the “Port Range” box, and “Anywhere” on the dropdown on the right.  Here’s what this configuration should look like:

Click “Review and Launch.”  Now, when you click “Launch” on the next screen, it doesn’t immediately launch your new virtual machine.  Instead, it brings up a VERY IMPORTANT dialogue.  Here’s what it looks like:

Be absolutely certain that you create a new Key Pair, name it, and download it somewhere you’ll be able to find it.  Once you’ve done that, you click “Launch Instances” and your instance will start booting up.  Go to the Instances Screen to see your server.

## Step 3: SSH to your new Instance

From the instances screen, you can see a bunch of information about the server.  The only thing about which we’re interested is the Public IP, which should be in one of the rightwardly columns.  Copy that IP Address so you can paste it into the command below.

Note: If you’re a Windows user, the most typical approach to doing this usually involves PuTTY.

If you’re on an Apple or *nix system, just open a terminal emulator, cd to the directory where you downloaded that secret key, and enter this:

ssh -i RStudio.pem ubuntu@YOUR.PUBLIC.IP

…where YOUR.PUBLIC.IP is the IP address you copied from the beginning of this step. This should give you a welcome message and a command prompt controlling the server.

If you have any problems, you can right-click on your instance in the instances menu and select “Connect”, which will give you much more verbose instructions about how to do this.  Plus a handy Java Applet if you are overcome by despair!

## Step 4: Install R

From your SSH connection, let’s install R.  Ubuntu provides a version of R, but they don’t keep it up to date, so we want to add a CRAN link to our sources list.

sudo echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/' >> /etc/apt/sources.list

Now, we need to add the security key for R, update the software list, and install R:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get -y install r-cran

## Step 5 (optional): Install RStudio

You don’t need RStudio to have or run Shiny Server, but it’s incredibly convenient to have both running on the same server.  That way, you have an R console that can edit the files in-place, and changes can be tested as quickly as you can change tabs and refresh.

RStudio maintains the step-by-step guide to installing RStudio, but here’s the short list of commands you need to run:

sudo apt-get -y install gdebi-core libapparmor1
sudo gdebi rstudio-server-0.98.1103-amd64.deb

This gets RStudio on your system, but it doesn’t permit logins by default.  That’s because it uses a password for authentication, but the only eligible user on the system (ubuntu) is only configured to login by private key (remember that RStudio.pem file?).  To create a password for yourself:

sudo passwd ubuntu

Remember that password: whenever you go to your RStudio instance, you’ll need it to login. Which, by the way, you are now able to do.  Open your web browser and surf to http://YOUR.PUBLIC.IP:8787/  and enter username “ubuntu” and that password you just set.  You should now be looking at the glorious RStudio interface, right in your browser!  Take a moment to rejoice.

## Step 6: Install Shiny Server

Once again, RStudio maintains the step-by-step guide to installing Shiny Server, but here’s the shortlist of requisite commands:

sudo su - -c "R -e \"install.packages(c('shiny','rmarkdown'), repos='http://cran.rstudio.com/')\""
sudo apt-get -y install gdebi-core
sudo gdebi shiny-server-1.3.0.403-amd64.deb

## Step 7: Configure Shiny Server

Check out the Administrator’s guide for much more detailed instructions.  Here are the only two config changes I made:

### Permissions

When it’s installed, Shiny server creates a new user on the system.  This is all well and good, but it means that you’re going to get weird permissions errors that you can only fix by prepending ‘sudo’ to most of your commands, which is generally bad practice.  Furthermore, R packages are local to users, meaning that any packages you install using RStudio (if you installed RStudio Server) won’t be accessible to your Shiny Applications.  Thus, to fix all this, run these commands:

sudo sed -i 's/run_as shiny;/run_as ubuntu;/g' /etc/shiny-server/shiny-server.conf
sudo chown -R ubuntu /srv/shiny-server/
sudo chgrp -R ubuntu /srv/shiny-server/

### Port

When you type in a URL, it doesn’t usually contain a port number.  By default, HTTP uses port 80, so http://aaboyles.com is (usually) equivalent to http://aaboyles.com:80.  Shiny server, however, uses port 3838 by default.  If you want to give out an address without a port for the Shiny Apps you host, you should change the port number in the Shiny Configuration file.  To do that, run this command:

sudo sed -i 's/3838/80/g' /etc/shiny-server/shiny-server.conf
sudo shiny-server restart

You should be able to visit http://YOUR.PUBLIC.IP and see the Shiny Server landing page, complete with two demo apps.

The easiest way to get Shiny apps is usually to clone them with git, so let’s install git, and cd into the server’s served directory:

sudo apt-get -y install git
cd /srv/shiny-server/

From here, you only need to git clone your shiny apps, and they’ll be publically available at http://YOUR.PUBLIC.IP/REPO.

Rejoice! Bask in the glory of your greatness! You have conquered Shiny Server!

### Access Event Data in Google BigQuery From R

I spent a good portion of this past spring working with GDELT. Unfortunately, a perfect storm of legal troubles and methodological criticisms this spring turned GDELT from the rising star of Computational Social Science into the ugly behemoth we’d all prefer not to discuss. I stopped updating my copy of GDELT and refocused my efforts to other projects.

So when Google Announced that GDELT would be hosted for free on BigQuery, I was only mildly interested (read: didn’t care). GDELT exhibits some ideal characteristics for a demonstration of high-speed data processing: It’s small enough that you could play with it on a well-equipped consumer computer, but large enough (currently ~100GB) that doing so is a terrible idea. Some of the queries I worked with, for example, would occupy hours of computing time on a fairly powerful machine. When I tried the some of the same queries on the BigQuery Console, it took ~10 sec, an improvement of three orders of magnitude.

But then I happened to be looking over the dplyr documentation and noticed that it had BigQuery integration, and I became slightly more interested. dplyr wraps the API from BigRQuery (one of Hadley‘s other projects). The effect of this is that we can integrate complex operations exceedingly large data sources into R workflows with little overhead.

Getting BigRQuery set up isn’t the most straight-forward project, but it isn’t too hard either. From the BigRQuery README:

If you just want to play around with the bigquery API, it’s easiest to start with the Google’s free sample data. To do that, you’ll also need to create your own project for billing purposes. If you’re just playing around, it’s unlikely that you’ll go over the 10,000 request/day free limit, but google still needs a project that it can bill (you don’t even need to provide a credit card).

To create a project:

2. Click “Create Project” at the top
3. Select a name and project ID, and click “Create”
4. Turn on the BigQuery API by clicking “APIs & Auth” on the left, scrolling down to “BigQuery API”, and clicking the button at the right from “OFF” to “ON”.
5. Click on “Overview” at the left
6. Either the Project ID or Project Number at the top can be used to identify your project withbigrquery.

From there, you can access GDELT through the BigQuery Browser Console. Theoretically, you should be able to access it through the API at this point, but I couldn’t get that working. Instead, I just made a copy of the events table and imported it into my own project.  To do that:

1. Click on the dropdown arrow next to your project’s name in the left pane.  Select “Create a New Dataset”
2. For Simplicity’s sake, just call it “GDELT”
3. Under “gdelt-bq:full” in the left pane, click the dropdown arrow next to “events” and click on “Copy table”
4. Copy the table to your project, to the GDELT dataset, and (again, for simplicity’s sake) call it “events”
5. You can access the data in R and play with it in dplyr.  Copy your project’s billing ID and paste it into the following gist:

When you try this, also take note:

The first time you use bigrquery in a session, it will ask you to authorize bigrquery in the browser. This gives bigrquery the credentials to access data on your behalf. By default, bigrquery picks up httr’s policy of caching per-working-directory credentials in .httr-oauth.

Note that bigrquery requests permission to modify your data; in general, the only data created or modified by bigrquery are the temporary tables created as query results, unless you explicitly modify your own data (say by calling delete_table or insert_upload_job).

Warning: This could cost you some money! Though the limit on free requests is 10,000/day, the limit on free data processed is 1TB/month.  Since GDELT is 100GB, that means you can run approximately 10 free queries on the complete GDELT database monthly.  Use them well!  (If you need more, 5TB is only $5 USD). Congratulations! In just a few minutes, you have gained access to hyper-fast querying on a hundred-gigabyte dataset right within R. While it doesn’t look like GDELT has much of a future in the academic community, perhaps this model (being hosted on Google BigQuery) is a valuable way for future massive, machine-coded datasets to be made accessible to the masses (I’m looking at you, Phoenix!). After all, any tool that reduces computation time by 3 orders of magnitude and fits seamlessly into the R environment we all already use is too valuable a prospect to ignore. ### GDELT's Unit of Analysis Problem In the course of the Event Data community’s reaction to Mona Chalabi’s use of GDELT to analyze events in Nigeria, some of us realized that a fundamental problem with GDELT is that its unit of analysis is unclear. This is a huge no-no. The Unit of Analysis in a well-formed dataset is represented by a single row of data. Some combination of the columns of that row will generate a predictable unique identifier. For example, International Relations data tends to represent countries across years. Hence we say that the Unit of analysis is the Country-year. The should only be one row in such a dataset for Afghanistan in 1990, so we might create a unique identifier called AFG1990, and rest easily in the confidence that it will be unique. Each row in GDELT represents one discrete event in theory. GDELT takes many “events” from many articles, aggregates them into a groupings it believes represent the same event, and builds the data based on this grouping. The problem is, it does this badly. We simply cannot trust the mysterious overlord program that builds and updates the data to actually code an event correctly and then figure out it reported by dozen other “events” it has already coded and cluster them together. One of Chalabi’s errors was to trust this claim of GDELT’s at face value. The way that we mitigate this problem is by saying that GDELT measures reporting coverage of events. This is true in a sense–reporting is one of the many filters through which the real, true information must be distilled before it comes to us in raw-data form. However, this line of thinking seduces us into believing that each GDELT row constitutes a single story about an event. This is wrong for two reasons: Firstly, GDELT doesn’t translate one article into one event. It translates every sentence in an article into an event. Secondly, it collapses many events that it codes as the same (i.e. same location, same time, same actors, same action) into a single “event.” But if we cannot trust these datapoints to represent real, meaningful, discrete events, we can at least use GDELT to measure the reporting of events. How then can we recover this mysterious “reporting” variable which seems to be the only marginally trustworthy product of GDELT? Well, GDELT provides 58 fields for each row, two of which are “NumMentions”, “NumSources”, and “NumArticles”. From the GDELT Codebook: NumMentions – This is the total number of mentions of this event across all source documents. Multiple references to an event within a single document also contribute to this count. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents and the density of events within them vary over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […] NumSources – This is the total number of information sources containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of sources varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […] NumArticles – This is the total number of source documents containing one or more mentions of this event. This can be used as a method of assessing the “importance” of an event: the more discussion of that event, the more likely it is to be significant. The total universe of source documents varies over time, so it is recommended that this field be normalized by the average or other measure of the universe of events during the time period of interest. […] If you desire to employ GDELT to analyze reporting covering of something (for example, Kidnappings in Nigeria between 1-1-1979 and Now), take these steps: 1. Query the data you want from GDELT. 2. For your temporal unit of interest (days at minimum, probably years at maximum), sum any of the three variables listed above. 3. Take a logarithm (any logarithm–it doesn’t really matter that much) of your counts. This will smooth out the natural exponential growth in reporting over the time period GDELT records. This isn’t the most nuanced handling, but it’s sufficiently effective for a “big picture” analysis. That will give you a pretty good measure of reporting coverage of your topic of interest. ### How to Fail at GDELT (with maps)! Mona Chalabi is at it again with GDELT in Nigeria! Once again, I have to give her props for digging with both hands into one of the richest data sources in social science today, and once again her crude handling of the data has lead her far beyond the realms of reasonable inference into wild conjecture. Prompted by a reader, Chalabi decided to build a time-series visualization of GDELT on a map. This is a popular treatment for such data, but inference from such things, as for all things GDELT (and perhaps all things social science), is a slippery business. Let’s go through this piece-by-piece. The 25,247 kidnappings that GDELT has recorded taking place in Nigeria since 1982 seem to have occurred across the country. First thing: Why 1982? GDELT goes back to 1979. If there’s some compelling reason here, I don’t see it, but this is the most minor of Chalabi’s infringements. Second and far more importantly, GDELT doesn’t count kidnappings. GDELT counts reports. If two different newspapers report the same thing two different ways, then GDELT will record them as two events. This can be a subtle difference: “Boko Haram Kidnaps Girls” and “Abubakar Shekau kidnaps Girls”. This point is so critical that it effectively invalidates any inference Chalabi makes beyond it. Never-the-less, let’s trudge ahead: The rapid acceleration in kidnappings in the past decade is obvious, but deciphering regional patterns is harder — especially for those of you who, like me, don’t have a detailed knowledge of Nigerian geography. Someone who has never touched GDELT before can be forgiven for thinking this. Chalabi cannot: Last time Chalabi used GDELT, she was kindly informed by “the helpful people at GDELT” (read: Kalev Leetaru) that GDELT grows exponentially. Let me say that again slowly: GDELT. GROWS. EXPONENTIALLY. It says so right in the Paper! This means that anytime you see growth in any phenomena as recorded by GDELT, it’s probably a product of the fact that GDELT itself grew during that time. I see no evidence of normalization to account for this fact. She’s made this error twice now, and the second time after she was explicitly corrected. Accepting this and moving on… I looked at population data and calculated the number of kidnappings per 100,000 residents in each of the country’s 37 states. This is a somewhat crude calculation. No kidding. We’re counting all geolocated kidnappings in the GDELT database since 1982 and dividing that by each state’s current population. Official kidnapping statistics for Nigeria aren’t available, and our numbers do provide a good relative picture; we can see where kidnappings in Nigeria are most prevalent. Just a reminder that this is incorrect: She’s counting reports of kidnapping, many of which will have significant overlap with one another. The kidnapping rate is the highest — 120 for every 100,000 people — in the Federal Capital Territory, which includes Abuja, Nigeria’s capital. I absolutely don’t believe this, and you shouldn’t either. I don’t know that this is the case, but it looks a lot like that Federal Capital Territory sits very near the centroid of the nation. If that’s the case, it is the default location for reports in Nigeria without more specific geolocation information. Reports of Kidnappings attributed to this spot almost certainly happened elsewhere within Nigeria, but there’s no way to know where. Three states in the south also stand out (Rivers, Delta and Bayelsa) with unusually high numbers of kidnappings relative to their population size. One possible explanation is the region’s oil wealth, otherwise known as the curse of the black gold. The United Nations news service has also highlighted how oil extraction in the south of Nigeria has been accompanied by violence and criminality. Great! Let’s invoke some controversial literature in developmental economics because it seems to fit our narrative. Alternatively, it could just be because crime is more prevalent in urban areas. Better yet, reporting coverage is better in urban areas. A kidnapping of one girl in the north might be covered by the local rag (if at all). A kidnapping one of the major urban centers might get picked up by the biggest regional or national paper, from which everyone else will begin reporting the same. One other state that was well above the median kidnapping rate (of five kidnappings per 100,000 people) was Borno in the northeast. That’s where the militant group Boko Haram, which is responsible for the recent mass kidnapping, is based. When we filtered the results to look at the first four months of 2014, Borno had the highest kidnapping rate in Nigeria. That couldn’t possibly have anything to do with the thousands of international news outlets covering the single kidnapping of 300 girls, could it? Girls were taken from the Government Girls Secondary School in Chibok, which is in Borno state. At the substate level, Chibok has rapidly become more vulnerable to kidnappings; GDELT has recorded 649 kidnappings there in the first four months of this year. GDELT recorded one in 2013, and none before that. There’s no support for this conclusion. The very fact that the “trend” she describes moves from 0 to 1 to 649 over the course of three years only suggests that there’s some fundamental change in the coverage between these years. The 649 number also comes from left-field: I’ve heard reports placing the number of kidnapped at around 300. So Chalabi’s assumption is that, in addition to the 300 all the news is about, there are around 350 other people kidnapped from Borno alone. Wrong! There were 649 reports of kidnappings in her dataset! We can make absolutely no inference about the number of kidnappings. The error Chalabi made in the first sentence has infected every claim she makes in the piece. I wasn’t the only one annoyed. Twitter was enragedErin Simpson in particular had a lot to say. She even put a bounty on a storified version of her rant, which I sadly lost by about five minutes. ### What's the deal with Kansas? Dylan Matthews over at Vox recently issued a slight correction to an article of his about Pornhub’s look at the politics of its consumers (SFW, believe it or not). Pornhub’s analysis suggested that Porn viewership in Kansas dominated that of all other states at 194 pageviews per capita. This was a conspicuous outlier, as it bucked the trend that Democratic-leaning states tend to watch more porn than Republican-leaning states. John Bieler bumped a similar issue with protests. Either there’s a field in the middle of Nowhere, Kansas where people go to protest and watch lots of porn, or there’s a bug in the analysis. When things like this happen in social science, it’s usually because of an error. Such things are always worth investigating. If it is an error, then you can fix it and strengthen your analysis. If it’s not an error, they you might have discovered something truly interesting. In this case, it was certainly an error. Which brings us back to Matthews at Vox and the correction. Pornhub analyzed IP-based geolocations. In other words, it recorded the IP address of everyone who visited, and then ran the IP address through a geocoder. Want to know where you are? IP-based geolocation is a type of coarse geolocation–it’s not super-accurate. Your phone can’t use it to figure out street directions, which is why you have a GPS chip. GPS is called fine geolocation. There are a bunch of IP addresses that are assigned to the United States. Some of those have no regional mapping, meaning that anyone using them could be anywhere in the lower 48, so far as the server is concerned. The server just records the IP and moves on. Then, when pornhub wants to run its analysis, it passes that list of IPs to the geocoder. When the geocoder gets one of these IPs that isn’t anywhere more specific than “United States”, it will return the centroid of the country. Where do you think the centroid of the US is? Now, attributing all this wierd stuff to Kansas isn’t the Geocoder’s fault. Geocoding a centroid in the absence of more information is a good strategy! Imagine you have to predict a person’s height, but you know nothing about that person except which country they’re from. Let’s say we know they’re from the United States. Well, what’s the average height of a person in the United States? Wikipedia breaks it down by gender, which makes sense: most adult men are a little taller than most adult women. But we don’t know the person’s gender. We could take the median of the average height, or we could build a population-weighted mean. But all of these solutions boil down to the same idea: take an average. The centroid is the average of all coordinates in a country’s physical boundaries. So it’s arguably the best guess we can provide about an event’s location in the absence of more information. In a way, all geocoded addresses are centroids: it takes the most specific geographic identifier it has (Country, State, County, City) and returns the centroid of that geographic region. If you want to discuss a map with geocoded data on it, you better be certain you know about that! Otherwise, you’ll end up correcting your articles after publication. If you’re working with international geocoded data, Frank at Gothos has made your job easy by puting together a handy dataset with estimates for the centroids of every country. Go forth, and be wary of claims that Kansas is special! ### Event Data in Popular Media Mona Chalabi of 538’s Datalab took a stab at reporting on the mass kidnappings executed by Boko Haram in Nigeria. While this is a very popular story at the moment, Datalab, 538, and the larger collection of trendy new news entities run by geeks offer a data-driven approach to news and analysis. Now kidnappings, like most things in social science, are exceedingly difficult to find or collect data for. The modern, popular approach to dealing with things which are hard to measure is to use event data. The idea behind event data is to parse the contents of large quantities of news reports and use machine learning to turn that article’s text into a bunch of raw data. The most available source of event data at the moment is called the Global Database of Events, Locations, and Tone, or GDELT. So it’s only natural that Chalabi should refer to GDELT to address this topic. Chalabi’s argument is that Kidnappings are on the rise in Nigeria. Chalabi offers this graph: Anyone who’s actively worked with GDELT at any reasonable length (or read its foundational ISA Paper) will recognize the exponential nature of the growth in GDELT entries. This occurs because GDELT grows exponentially over time. Jay Ulfelder observed this error on Twitter: Luckily, someone (probably Kalev Leetaru himself) reached out to Chalabi to try to correct the course of the evidence. You rarely want to plot raw numbers of events, since global news volume available in digital form has increased exponentially over the past 30 years (it increases by a fraction of a percent each day), meaning the total universe of events recorded in the news and available for processing has increased exponentially over that time. Instead, you want to divide the total number of what you are after (kidnappings in Nigeria) by all GDELT events for a given time period to get a normalized intensity measure. Chalabi summarily made the modification, leading to this graphic: Jay Ulfelder still wasn’t convinced: GDELT is a notoriously noisy source of data. While I agree (with Ulfelder) that it’s a pretty tall claim to make that kidnappings are actually on the uptick, I’d like to emphatically add that it isn’t necessarily incorrect. GDELT is a powerful tool, and it may lead to some breakthroughs in the social sciences (its legal and academic challenges notwithstanding). However, it’s critically important to remember that GDELT is powerful for its ability to show trends in broad strokes (months, not days) over long periods (decades, not years). Diluting the counts over time in this way diminishes the effects a variety of errors, including the one Ulfelder highlighter. I hope Chalabi can take the lesson of this forward for future reporting using GDELT. ### The Best Statistical Programming Language is …Javascript? R-Bloggers has recently been buzzing about Julia, the new kid on the statistical programming block. Julia, however, is hardly the sole contender for the market of R defectors, with Clojurefork library Incanter generating buzz as well. Even with these two making noise, I think there’s a huge point that everyone is missing, and it’s front-and-center on the Julia homepage: Julia Python Matlab Octave R JavaScript fib 3f670da0 2.7.1 R2011a 3.4 2.14.2 V8 3.6.6.11 1.97 31.47 1336.37 2383.80 225.23 1.55 1.44 16.50 815.19 6454.50 337.52 2.17 1.49 55.84 132.71 3127.50 713.77 4.11 5.55 31.15 65.44 824.68 156.68 5.67 0.74 18.03 1.08 328.33 164.69 0.75 3.37 39.34 11.64 54.54 22.07 8.12 1.00 1.18 0.70 1.65 8.64 41.79 Julia kicks ass on the benchmarks, but it also has a severe uphill battle. It’s new, it’s Linux, it’s command-line-only, and it doesn’t have support for the wide array of Statistical functionality available in R. But besides the obvious my-language-can-beat-up-your-language comparisons, notice anything interesting? Think orders of magnitude? Javascript performs nearly as well as Julia down the board. This nearly floored me. ### Measuring Global Population using System Dynamics For my zombie population dynamics paper last fall, I decided to induce the global effects of an outbreak of zombiism by starting with an empirically accurate global population model. Looking back, that baseline model is a really cool way to demonstrate system dynamic modeling. System dynamic modeling is a process by which we define the causal relationships between variables using logical and mathematical functions, and then use those in turn to forecast future effects. This sort of modeling works very well with anything that can be defined in terms of population counts. The canonical example is the Lotka-Volterra predator-prey model. In this model, a predator population thrives on a population of prey. As one population grows, the other shrinks, and vice versa. The global population model is a bit simpler since we’re dealing with a single population (that is, humans). We’ll call this a gain-loss model. Gains are defined by births, and losses by deaths. After all, when modeling the entirety of humanity, there are only these two ways to change the global population (i.e. humanity’s total population count is unaffected by migration, divine ascension, etc…). Let’s take a look at that visually: The Box at the center represents the entire human population. The “flow” to the left of the box indicates births increasing the number of humans in the population, and the one to the right is deaths decreasing the total population. The two “converters” (circles at the top left and right) define the rates at which births and deaths occur. This is a cool setup because we can define any of these as numerical values or as functions. For example: $$Births=GlobalPop*BirthRate$$ and likewise with the deaths. Thus, $$GlobalPop_{n+1}=GlobalPop_n(1+BirthRate_n-DeathRate_n)$$ Now, in order to define these numerical values, we must consult some statistics. For this, there is the World Development Indicators. I parsed the very same into separate datasets not so long ago, but this time it’ll be easier to just use the graphs. For that, Google has set up a really nifty Public Data Explorer. To start with, let’s look at Birthrate. The graph looks pretty nonlinear, the take home point is that it’s trending downwards. As such, we can represent it linearly with the knowledge that this may be something to revisit later on the the model construction. Over the 49 year period from 1960-2009, the value drops from 4.91 to 2.52. That’s a change of 2.39, or a mean change of .049 annually. But what does this number actually mean? “The average number of births per woman.” Crap. This creates two new problems. The first is pretty clear: We will know how many people there are at any given time, but not how many women. We’ll have to include another converter to capture this fact. The second is much more subtle: this is evaluated over steps of time (in this case, years). But the rate is measured over a woman’s lifespan. So we also must include a variable which represents the mean female lifespan. All told, here’s our model: where: $$MeanLifespan=54.6+.345n$$ $$Birthrate=(4.91-.049n)/MeanLifespan$$ Where the numerical values are derived from time series regressions on the data for female life expectancy and fertility rate. As for the percent of the population which is female, it looks to be roughly stagnant around 49.7, so a fixed value should be good enough. Now, all that remains is to define the Deathrate and test the model! Deathrate presents a new problem–a truly nonlinear relationship. As such, we’ll divide it by 1000 to give us a per capita probability of death per year, and fit the prediction curve using exponential regression, giving us this equation: $$Deathrate=.018907*n^{-0.21}$$ This curve actually gives a strikingly good fit to the data at hand: With all of our values set and relationships defined, its high time we tested this puppy out! We’ll set the initial population equal to the population in 1960, approximately 3 billion. We should see population growth up to today’s estimate of 6.775 billion. Let’s check it out, shall we? Not too shabby. Hardly perfect, but considering everything that can go wrong, I think we did pretty OK! Two dangling observations: First, we started out modeling a phenomena which is strikingly linear. Second, our prediction is slightly nonlinear. Slightly, but conspicuously. System Dynamic modeling is excellent at representing nonlinear phenomena, so it’s a little gratuitous to use it in a linear case. Never the less, our model preforms very well: We can see here the R-squared of .98, indicating a high degree of accuracy. The Beta is pretty good as well–if it were perfect, we would see a beta of exactly 1. However, if we compare this to a linear regression on the time series data, we look a little worse: Not so good. Oh well. There are a few sources of this error: 1. The life expectancy is latent–it describes the life expectancy of people born in that year, not the life expectancy of people living in that year. 2. We didn’t describe the female percentage of the population with a function, which makes it entirely inflexible. 3. The pseudo-linear models we use to generate many of the coefficients are imperfect (though admittedly very good in this case). 4. The data are imperfect: the World Development Indicators are calculated using a completely different methodology, which one hopes would be fairly accurate. Even so, it would be logistically impossible to count every single person on the planet at a fixed point in time–people are born and die every minute. The WDIs are correct to the ballpark (as are the models we devise from them), but only so. Even so, I claim we’ve accomplished something moderately cool. We’ve set up a complex equation set which predicts the global population, controlling for life expectancy, fertility, proportion of the population which is female, and per capita deathrate. Our model could survive an exogenous shock to any of these, while the linear model is left scratching its head. System Dynamic modeling is cool because it is necessarily representative of the causal linkages between variables, thereby creating much more compelling forecasts. This model hinges upon the irrefutable causality between births, deaths, and population size. Maybe next time we can try something a little more debatable, and shed some light on something new. ### Measuring the EIU Democracy Index (with Polity IV) Yet again, I have conjured up an (academically) unusual dataset on democracy! This time it’s the Economist Intelligence Unit’s Democracy Index, a weird little gem. The dataset is the basis for a paper the Economist publishes every two years. Because of this biannuality, there is data estimating the “Democratic-ness” of the world’s countries for 2006, 2008 and 2010. What happened between those years, God only knows. I dumped the data into a CSV file and merged them with the polity data. Since they’re both measures of democracy, I hypothesize (again) that they should be fairly linearly correlated. Shall we take a peek? EIUmergePolity <- read.csv("http://nortalktoowise.com/wp-content/uploads/2011/07/EIUmergePolity.csv") attach(EIUmergePolity) model1 <- lm(polity2 ~ Overall) summary(model1) Call: lm(formula = polity2 ~ Overall) Residuals: Min 1Q Median 3Q Max -9.9426 -2.4366 0.1764 2.1389 10.5728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.39327 0.42154 -22.28 <2e-16 *** Overall 2.41504 0.07162 33.72 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.475 on 470 degrees of freedom (35 observations deleted due to missingness) Multiple R-squared: 0.7076, Adjusted R-squared: 0.7069 F-statistic: 1137 on 1 and 470 DF, p-value: < 2.2e-16 Yet again, we find a good linear model which just isn’t all that impressive. An R-squared of .7 is not half bad for an explanatory model, but these are two different measures of the same thing. They should be more closely correlated than that. Let’s take a look, shall we? plot(Overall, polity2, main="Democracy Index over Polity2 Score") abline(model1) Damn! It looks kind of non-linear. Again. Taking the same technique I used to fit a quadratic curve to the data last time, maybe we’ll get closer. model2 <- lm(polity2 ~ Overall + I(Overall^2)) summary(model2) Call: lm(formula = polity2 ~ Overall + I(Overall^2)) Residuals: Min 1Q Median 3Q Max -10.9995 -1.3065 0.3161 1.5667 11.9635 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -14.51086 0.88953 -16.313 < 2e-16 *** Overall 4.70622 0.36131 13.026 < 2e-16 *** I(Overall^2) -0.21243 0.03289 -6.459 2.64e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.334 on 469 degrees of freedom (35 observations deleted due to missingness) Multiple R-squared: 0.7314, Adjusted R-squared: 0.7303 F-statistic: 638.7 on 2 and 469 DF, p-value: < 2.2e-16 Doesn’t help much, I’m sad to say. We see a small increase in the model’s descriptive value, but it’s really small. If we draw the curve to the scatterplot, will it at least look better? plot(Overall, polity2, main="Democracy Index over Polity2 Score") x <- seq(0,10) y <- model2$coef %*% rbind(1,x,x^2)
lines(x,y,lwd=2)

Yeah, it looks a little better, but there’s not much that can be done.  It seems that there’s a fundamental inconsistency in the way we measure governance, particularly at the extreme ends of the spectrum.  Perhaps this is characteristic of data measuring the same thing in different ways.  Anyone know any literature examining this sort of thing?

### More fun with the Failed States Index (and the State Fragility Index)

So the other day’s experiment with the Failed States Index and the Polity Data didn’t yield the linear trend I had originally expected.  After all, the two measure fundamentally distinct things.  But perhaps there’s another dataset which will match linearly.  The same people who made polity also put out a dataset called the State Fragility Index.  Perhaps their definition of Fragility will be similar to the Fund for Peace’s definition of Failure.  I merged the 2009 data to take a peek.

attach(SFImergeFSI)

OK!  Let’s start this off with a good old-fashioned Linear Model.  After all, if these two are measuring the same thing (and they’re both doing it well), they should be very linearly correlated.

model1 <- lm (sfi ~ Total)
summary(model1)
Call:
lm(formula = sfi ~ Total)

Residuals:
Min      1Q  Median      3Q     Max
-6.6865 -2.4735 -0.2238  2.3175  6.7389

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.31988    0.79689  -10.44   <2e-16 ***
Total        0.23536    0.01039   22.65   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.118 on 159 degrees of freedom
Multiple R-squared: 0.7634,	Adjusted R-squared: 0.7619
F-statistic: 513.1 on 1 and 159 DF,  p-value: < 2.2e-16

Alright, that’s not so bad.  The R-squared is smaller than expected, but we’ll talk about that in a minute.  Also, there’s a small beta of .23, but that’s largely a byproduct of the differing magnitudes of the scales.  We could normalize that to just about anything by multiplying the State Fragility Index.  If we cared enough, we could jump straight to an approximation of one (which would be sort of ideal) by making the magnitudes of the scales match.  State Fragility Index measures from zero to 25, and the Failed States Index measures from zero to 120, but has not yet spanned either of its extremes (and probably never will) .  So, if I cared about normalization, I might multiply the State Fragility score by (120/25).  Better yet, I’d subtract out the min of the Failed States Index from each value of the FSI, and then multiply the State Fragility score by the Ratio of max(Failed States Index)/max(State Fragility Index).  But I don’t care that much, because doing stuff like that doesn’t change the topology of the data.  If it does, then you’ve done something horribly wrong.

Getting back to the data, let’s have a look at the scatterplot.

plot(Total, sfi, main="State Fragility Index over Failed States Index")
abline(model1)

Huh.  The Model’s OK in numbers, but the scatterplot tells us a great deal more.  It doesn’t actually look linear, frankly.  I can think of two possible explanations: one, there’s something weird about the way the State Fragility Index assigns values less than 3; or two, it’s just not linear.  Let’s look at the nonlinearity possibility.

To do this, we’ll make another linear model, only this one will be multivariate.  Let’s explain the State Fragility Index using the Failed States Index, and the square of the Failed States Index.  This is cool because the computer can figure it out like a normal linear model of three variables, and then we can use the coefficients to graph the quadratic function.

model2 <- lm(sfi ~ Total + I(Total^2))
summary(model2)
Call:
lm(formula = sfi ~ Total + I(Total^2))

Residuals:
Min      1Q  Median      3Q     Max
-7.1186 -1.6911 -0.1213  1.4091  8.2868

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.9030666  1.4915584   0.605   0.5457
Total       -0.0938886  0.0479218  -1.959   0.0518 .
I(Total^2)   0.0025158  0.0003595   6.998 6.97e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.733 on 158 degrees of freedom
Multiple R-squared: 0.8194,	Adjusted R-squared: 0.8171
F-statistic: 358.4 on 2 and 158 DF,  p-value: < 2.2e-16

Our R-squared has improved a bit, which means this is definitely a better fit.  The coefficients may look a little whacky, but the make sense when you graph them.  By the way, we can’t graph this with abline(), since it’s got mulitple variables, so we have a little hack for that, picking points uniformly across the x axis and binding them with their appropriate y values.

plot(Total, sfi, main="State Fragility Index over Failed States Index")
x <- seq(min(Total),max(Total))
y <- model2$coef %*% rbind(1,x,x^2) lines(x,y,lwd=2) Well, that certainly looks a lot better. It also lacks a strong theoretical basis. Why in the world should the Failed States Index define its distribution linearly and the State Fragility Index define it quadratically? Makes no sense to me. That’s why I’m inclined to like the other explanation better: perhaps there’s a methodological thing (technical term) which causes the State Fragility Index to assign values under three differently than all the other values. To test this, let’s break up the dataset. sub1 <- subset(SFImergeFSI, sfi>3) And now we should be able to build our linear model! model3 <- lm(sub1$sfi ~ sub1$Total) summary(model3) Call: lm(formula = sub1$sfi ~ sub1$Total) Residuals: Min 1Q Median 3Q Max -7.3079 -2.4196 0.0314 2.4322 7.9236 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.88426 1.90157 -7.301 3.86e-11 *** sub1$Total    0.30152    0.02216  13.607  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.108 on 116 degrees of freedom
Multiple R-squared: 0.6148,	Adjusted R-squared: 0.6115
F-statistic: 185.1 on 1 and 116 DF,  p-value: < 2.2e-16

This r-squared is 6ish.  But the r-squared of our first model was .75ish.  We select the subset of the data that looked linear, run our regression, and the model fits less well?  What a rip-off!  Guess that means I should stick to the non-linear interpretation, but god only knows what that means in practical terms.  I’ll let that remain a mystery for today.

plot (sub1$Total, sub1$sfi)
abline(model3)

Well, that was frustrating and not especially informative.  We now know that the Failed States Index is correlated with the State Fragility Index somehow.  That “somehow” reinforces my belief that the Failed States Index Data is totally worth its salt for quantitative analysis, and just waiting for someone much smarter than me to come by and do some analysis.  I’m looking at you, R-bloggers.

### Analyzing the Failed States Index (with Polity IV)

So, I decided to sit down and have a little fun with that Failed States Index data I put together. To start, I expect that the dataset will be pretty linearly correlated with the polity IV data. This makes sense–true democracies aren’t failed states, and failed states tend not to be democratic. To test this, I merged the two datasets for 2010. Let’s load them into R…

attach(FailedStateIndexPolity.Merge)

And take a look at a scatterplot of the two variables, Total (which is the sum total of the failed state indicators) and polity2 (which is a country’s democracy score minus its autocracy score):

plot(polity2, Total)

Whoa, wasn’t expecting that. It seems that most failed states have polity scores closer to zero than -10, and the more extremely democratic or autocratic a state is, the lower its failed state score will be. On second thought, maybe this does make sense: strong democracies won’t be failed states, and strong autocracies will supplant economic well-being and personal freedoms with strong state control, which failed states should lack the ability to exert. If that’s true, we should look at the relationship between the failed state’s total and the polity’s democ and autoc scores.

sub <- subset(FailedStateIndexPolity.Merge, democ>-60)
plot(sub$democ, sub$Total)

OK, this makes a little more sense.  More democratic states definitely look a lot less like failed states.  If my hypothesis about strong state control holds, we should see almost the same relationship for the autocracy score.

sub <- subset(FailedStateIndexPolity.Merge, autoc>-60)
plot(sub$autoc, sub$Total)

Huh, that’s cool.  The trend is similar, but we have a ton of observations at zero.  In other words, of the states without an autocratic bone in their borders, there exists the full array of state failure levels.

Holistically, this supports a conclusion I never really came to in my prior analysis of regime metrics–namely, that state failure isn’t simply the absence of a democracy.  It’s the ineffectiveness of any government structure, democratic or authoritarian.  Of course, paramount in this conclusion is the understanding that democracy and authoritarianism aren’t linearly exclusive.

sub <- subset(FailedStateIndexPolity.Merge, autoc>-60 & democ>-60)
plot(jitter(sub$democ), jitter(sub$autoc)
abline(0,1)

Because the states whose polity scores are close to zero, and polity is the difference between democ and autoc, the points closest to the line (which represents democ – autoc = 0) are the most failed states.

### Wargaming The Battle of Prokhorovka

I took a class this semester called “Wargaming.”  It was a project course in which teams of five were instructed to devise an agent-based model which reenacted an historical battle.  The teams were all split between Computer Science undergrads and International Affairs grads, in hopes of training us for the kind of cross-talk between the disciplines which would be expected of us in a professional setting.  My group selected the Battle of Prokhorovka, a tank battle within the larger battle of Kursk.  This quickly attracted my professor’s interest, since he had never had a group attempt a tank battle before.  (He had always favored pre-modern battles for their comparative simplicity.)

The idea driving this course is that by constructing models which reenact battles with relative accuracy according to the historical record, we can then alter variables to determine what errors were made by the commanders.  For Prokhorovka, the Germans were ordered to retreat before the outcome of the battle was established, since the Northern Front of Kursk had faltered and the Panzerkorps were needed to reinforce German support of Italian defenses following the Allied landing in Sicily.  For this reason, the battle is considered to be a German loss, though a number of historians have claimed that the Germans could have won the battle.

In addition to the withdrawal, the Germans held the XXIV Panzerkorps in reserve for strategic reasions.  If the Germans couldn’t guarantee a victory by remaining in the battle, perhaps they could solidify that victory by putting the XXIV into play.  My team built the model with these two variables (withdrawal and XXIV) as the elements which could be controlled between runs.  Here’s the model. [I apologize, the interface was designed by one of the CS undergrads who used an absurdly high resolution, so it’s a pretty big Java applet.]

As it runs by default, the Germans retreat.  This is exactly what happened in reality, but the force depletions (listed by the graphs in the bottom right) are totally wrong.  The Soviets followed the Germans after the retreat, overextending their defenses by taking the offensive, and taking massive casualties because of it.  So we admit that the model is inherently flawed, but the mantra of the course (and the modeling community at large, it seems) was “All Models are Wrong.”  Even so, I would have liked to start with a slightly more accurate model (forthcoming).

Setting aside this enormous methodological flaw, let’s continue playing the game.  If we run the model again with the German retreat turned off, the Germans pretty reliably take Prokhorovka, and often then with no remaining tanks (if you turn “Deathmatch” on and hit run again, you’ll see this behavior after the German footsoldiers take the town.  So it seems the inclusion of the XXIV is entirely superfluous.  Or is it?

Modeling in this manner is, frankly, a little silly.  I could rewrite the source of the model (seen at the bottom of the model’s page) so that the Germans are slower (which would more closely match historical record).  And I could add more Soviet troops, also probably more accurate.  And I could try to rewrite the damage model so that tank guns are seen to be more effective at longer ranges, but Soviet tanks execute the kamikaze collision attacks the Soviets actually did.  Or we could do exactly the opposite.

In order to get the German force depletion rates down to levels which approximately matched the historical records, we fortified the German forces with a Deathstar.  Yes, that deathstar.

[If you want to try it out, the variable switch is in the top right corner of the applet.]

Because this model holds little ground in the historical record, its results are entirely unsound.  Were the model a little more faithful to the force depletion mechanics of the German escape, I might be more confident in my group’s work.  That said, I learned a lot from this project.

As a business matter, I now know how to write ABMs in Netlogo.  This is not an especially difficult language to master, especially for someone with a reasonable knowledge of programming control structures and boolean logic.

More importantly, I learned some of the limitations of ABMs.  As abstract models, ABMs can help us understand incredibly complex behaviors with very simple rules, a la John Conway’s Game of Life.  In terms of International Relations, this can be a valuable theory-building tool (for example, see Cederman 2003).  However, an attempt to rebuild a complex real-life event (like a specific battle, for instance) is on less solid methodological grounds (though I can’t think of many other approaches which would be appropriate for this sort of study).

I’ll be playing with ABMs this summer, between my class (GIS), internship (OPAR), job and other projects.  We’ll see if I can’t build a more cohesive approach to Prokhorovka, and ultimately some other more theoretical work (like a replication of Cederman’s model, though that may be down the road a bit).

For posterity, the most important thing I learned from the class: All models are wrong.

### The Monty Hall Problem and Optimal Guessing Technique on the GRE

I was teaching a GRE class today when I was struck by a sudden inspiration: the quantitative comparison problems are basically Monty Hall Problems.

For those who don’t know, Monty Hall was the host of the TV game show Let’s Make a Deal.  In the show, a contestant (you) were offered three distinct doors.  One concealed a new car.  The other two hid goats.  You would select a door, and then Hall would pick one of the other two doors, and open it.  It would invariably contain a goat, thereby leaving two doors, behind one of which awaited the hallowed car.  He then offers you a choice: you could keep your selected door, or you could swap for the remaining door.

The problem is calculating the optimal strategy.  The intuitive approach is to assume that your winning probability is uniform across your choices, and therefore it wouldn’t really matter what you did–you’d win 50% of the time.

This is entirely incorrect.

At the moment you selected your door, your probability of winning with that door was 1/3, or about 33%.  Everyone can generally agree on this point.  The problem is that when Hall eliminates one of the wrong doors, your probability of being correct doesn’t change.  In other words, the probability of winning with your door remains 33%, but the probability of winning with the other door (i.e. by switching) becomes 2/3 or about 67%.  Therefore, the only rational strategy is to switch every time.  (A mixed strategy only wins with probability 5/9).

So, what does this have to do with guessing on the GRE?

Quantitative Comparisons or “Quant Comps” all follow basically this format:

x + 2y > 8
Column A                          Column B
2x + 4y                              20

A) The quantity in Column A is larger.
B) The quantity in Column B is larger.
C) The two quantities are equal.
D) The relationship cannot be determined.

In writing the answer choices for these problems, the ETS left out one critical piece of information: A, B, or C must be true for all values of x in order to be correct.  In other words, if we use every possible value of x and y, we will uniformly generate the answer every time we try.  If we can pick any two pairs of numbers such that they lead to different conclusions, then the answer must be D.  However, this can be a tall order.

The trick to approaching these when they involve Algebra is to select a set of numbers which satisfy the conditions given, and then to eliminate answer choices based on the numerical comparison of the two columns.  For example, I could say for this one that x = 3 and y = 7.  Thus the inequality is

(3) + 2(7) > 8

which is verifiably true.  Because these numbers are valid, we can calculate the value in Column A, which is

2(3) + 4(7) = 34

Since 34 is greater than 20, we can be certain that of the answer choices, B or C cannot possibly be correct.  We have not, however, demonstrated that A is the correct answer.  It could still be D.

When we started the problem, there was a uniform distribution of probability across the answer choices.  We know this because it is hardwired into the test to be so.  We implicitly selected D as our answer (just as we selected a door above), and then by eliminated 2/3 of the remaining answers (just as Monty Hall did with the goat door).  The probability of D being correct is still 25%, but the probability of A being the correct answer is now 75%.  Armed with only this information, it would be easy to guess your way to a statistically better score.  Go figure.

PS) The answer to this one in particular is D.  Sorry if you guessed!

[I cross-posted this at the xkcd forum to stir up some controversy.]

### An Assessment of the Utility of Sanctions

[This is a post of a Paper I wrote for INTA 6103: International Security]

The modern use of economic sanctions in place of warfare is a hotly debated policy.  While sanctions may be used as an alternative to warfare, the question of whether they are effective (and therefore whether they are an appropriate policy maneuver) is an open question.  On one side, proponents of sanctions have made empirical arguments indicating that sanctions do achieve the desired effects with reasonable frequency.  Their opposition applies a carefully reasoned, yet ultimately flawed, reconsideration of the definitions, operationalizations, and empirical evidence to conclude sanctions are not effective.

Robert Pape presents one such refutation to studies suggesting the efficacy of sanctions[1].  He opens his discussion on the matter with the assertion,

The decisive question I ask is whether economic sanctions are an effective tool for achieving international political goals, and if so, under what conditions. I do not address the broader question of whether sanctions are an effective substitute for war (which would require an analysis of the relative utility of both force and sanctions for achieving political ends.

I find this approach to be questionable.  Consider sanctions to be a screwdriver (in comparison to warfare, a hammer)—were we to approach the study of the utility of the screwdriver using Pape’s method, we would place the hammer in an opaque box and observe the screwdriver.  This approach to measuring the utility of tools is both comical and ineffectual.  To look at a screwdriver and claim it is not effective is meaningless: we must justify measure the effectiveness of the screwdriver relative to another tool.  Pape’s study is thus an exercise in futility out of the gate.

Looking past Pape’s methodological oversight, he provides a rigorous and agreeable definition of sanctions, up to a single point.  He distinguishes between economic warfare and economic sanctions.  This serves to distance his approach to defining sanctions from that of the study he seeks to debunk.  By defining the scope of sanctions differently than the original study, he achieves a different conclusion.  This should not be surprising, and yet Pape believes it is fundamental enough to trumpet the superiority of his conclusion.  To his credit, Pape may have been justified in doing this if he could provide a more appealing theoretical support for his interpretation of sanctions.  Unfortunately, Pape fails to do this.  His theory for the exclusion of acts of economic warfare from the classification of sanctions is

Economic warfare, such as naval blockades, conducted purely for the purpose of weakening an opponent’s military capability in an arms race or ongoing war does not qualify as an economic sanction because it is merely a component of a coercive strategy centered on force, not an independent coercive strategy in its own right. This does not exclude the possibility that economic sanctions could be employed in war, but the expected coercive mechanism would have to be that the target state would make concessions because of its unwillingness to bear the economic pain, not because its economic weakness would lead to military defeat.

On this I must beg to differ.  The act of weakening an adversary’s military does not occur in a vacuum.  If trade of arms is interrupted, the economy which ordered the arms may take a hit.  Whether or not the nation is presently involved in an arms race or armed conflict is not consequential to this effect.  Interrupting markets will harm the economy.  Thus excluding acts of economic warfare (which Pape would title economic sanctions at any other time) is merely an arbitrary selection of cases which Pape wishes to excise from validity.  Fortunately for him, it seems these cases represent a very small sample (N=1, so far as I can tell) of the instances of sanctions he analyzes[2].

For the majority of cases Pape leverages as evidence in support of his thesis, I do not have any criticism.  Without a much more careful analysis of the historical record, I could only cavil more about Pape’s definitions.  Instead I will offer that I find his argument to be modestly compelling: accepting all cases of concessions amidst sanctions is too liberal an approach to generate a meaningful estimate of the absolute effectiveness of sanctions.  With this in mind, I would conclude that Pape’s estimate for the proportion of cases which are valid examples of effective sanctions is a reasonable lower bound, and conversely the estimate of the study he refutes an upper bound.  This is an appreciably wide range, but without more data and a larger set of the particular variety of cases in question, definitive conclusions concerning the efficacy of sanctions will remain elusive.

With this range, Pape has demonstrated at the very least that sanctions are not effective a vast majority of the time, irrespective of the relative effectiveness of armed conflict.  However, without an alternative policy tool to analyze against, we are left with the uncomfortable decision of whether to advocate the use of sanctions or not.  In the absence of other coercive policy tools against which we may compare sanctions, this is a fool’s advocation.  Pape makes further inference based on the small sample of cases he accepts as effective that three hypotheses may elucidate the effectiveness of sanctions.  First, they should not be applied unless the issue at stake is considerable in magnitude.  Second, they are not the humane alternative to warfare which many make them out to be.  Third, they may be applied in conjunction with force to generate results.

This conclusion has not deterred further research in the field, however.  Hovi, Huseby, and Sprinz turn to formal modeling.  Instead of examining the empirical data, they indentify a disconnect between the threat of sanctions and the actual use of sanctions.  They do this to suggest that the threat of sanctions is insufficient in some cases, thereby explaining the existence of cases in which sanctions are put into effect, and target of the sanctions makes the desired concession[3].  In this way the researchers are defining the conditions under which it is rational for a nation to concede to the threat of sanctions, to concede to the institution of sanctions, or to endure the consequences of sanctions indefinitely.

If we accept this finding, we can infer something far more novel than Pape does: namely, that instituted sanctions are a selected (and therefore biased) subset of threats of sanctions.  Sadly, Hovi, Huseby, and Sprinz do not extend their research to this length, but the potential exists for an empirical analysis which could supplant the usefulness of previous empirical discussions on the matter.

Besides the canonical assumption of actor rationality in the model, Hovi, Huseby, and Sprinz assume that there exists a meaningful distinction between lenient and potent sanctions, if sanctions should be implemented.  They present this assumption with two justifications: First, it allows for calculation of threat credibility (implicitly as somehow inversely proportional to threat magnitude).  Second, it allows a target to concede after the sanctions are in place.  As a methodological choice, I do not see the advantage to building the model in this way as opposed to simply defining a single sanctions variable which is continuous between absolute embargo and no sanctions[4].  Neither of their justifications adequately address this issue.

Again setting aside quarrels on the construction of the model, the equilibrium conclusions of the model are interesting.  Under perfect information, the imposition of sanctions is never in equilibrium for any set of conditions.  Because we live in a world in which sanctions do occur routinely, we can safely conclude that this finding is not reflective of reality.  However, in a system of uncertainty, there exist some conditions under which the imposition of sanctions is possible.  For this conclusion, we can potentially accept the validity of the model for policy guidance.  I would, however, treat this with some trepidation until an empirical analysis of sanctions threats versus sanctions could confirm the real-world usefulness of an equilibrium in which sanctions are installed.

If we are to derive policy advocations from this model, sanctions should be effective when the imposition of sanctions is judged to be unlikely by the target.  The optimal policy is to credibly threaten sanctions which exceed the target’s pain tolerance.  However, if this is not possible, effective sanctions require incredible threats which the sender intends to enact.  In other words, don’t bluff.  This is not necessarily an easy trick to pull off, hence an empirical prediction about effective sanctions should more closely match Pape’s than the upper bound.  On the flip side, potential targets of sanctions should simply consider all threats to be credible, and decide based on whether the losses given will exceed the benefit from yielding no concessions.

The application of sanctions is a hotly debated topic for very good reason: it could stem the effects of warfare.  Contrarily, it could exacerbate said effects.  However, neither of the papers reviewed here offer definitive enough evidence to make strong policy suggestions in either direction (more sanctions or less sanctions).  They do, however, offer the framework for designing the appropriate studies to tease out the information we ultimately seek.  Are sanctions effective relative to warfare?  Are they rational, given the probable costs of war?  Future research to address these questions may make fundamental changes to the way we approach coercive policy.

[1] Robert A. Pape “Why Economic Sanctions Do Not Work” International Security, Vol. 22, No. 2 (Autumn, 1997), pp. 90-136

[2] This is inferred from Pape’s own discussion of the empirical history.  Whether historical record corroborates this assertion is outside the scope of this discussion.

[3] Jon Hovi, Robert Huseby, Detlef Sprinz, “When Do (Imposed) Economic Sanctions Work?” World Politics, Volume 57, Number 4, July 2005, pp. 479-499

[4] Alternately, if the model were simplified as suggested above, it would not be far removed from a simple bargaining model.  Played out over an indefinite number of steps (N), an analyst may be able to reveal more on the nature of sanctions strategies, i.e. why not all successful sanctions yield concessions immediately.

### Terrorism, Deterrence by Denial, and Incentive Structures

For my Deterrence course, I was given an assignment to critically review an Essay of my choice concerning Deterrence.  I picked Emanuel Adler’s “Damned if you do, Damned if you don’t” [gated].  Adler advances the notion that modern deterrence thinking presents a detrimental dilemma: the decision to respond to aggressors (terrorists by conventional means or states by nuclear means) by either responding with force, or not.  Neither of these actions is particularly appealing: those who respond with force will play into the aggressor’s expectations, leading to military exploits.  The second option is worse: by declining to step up against aggressors, nations thereby invite more attacks.

These concerns are not unique, hearkening back to second-wave nuclear deterrence theory.  By rationalizing force and the threat of force, Adler constructs a virtual game in which no option given to the player (policymakers, in this case) is favorable.  Adler then proposes that a superior solution may exist by changing the parameters of the game, instead of the strategies.  This proposition resounds with the possibility, familiar to anyone with knowledge of game theory.  The only difference between a Prisoner’s Dilemma (i.e. a game of mutual defection) and a 2-person Stag hunt (i.e. a game of mutual cooperation) is the payoff amounts.  With the same mentality about terrorist acts (though the calculations are fundamentally asymmetric and therefore more complex), perhaps we can generate a solution to get out of the dilemma.

Defusing the Dilemma

Adler lays out a series of policy changes which, if implemented, should help nations break out of their dilemmas.  He calls this process “defusing by denial.”  In contrast to “deterrence by denial,” defusing emphasizes defensive measures, assuming the enemy already has the weaponry necessary to carry out an attack.  In short, defusing by denial calls for policymakers to:

1.      Harden likely targets to minimize provocations.  By increasing defenses of favorable terrorist targets, policymakers lower the probability of inciting carnage in an attack, and therefore make attempting such attacks less attractive options to terrorists.

2.      Improve civil defense measures (e.g. police forces and disaster-response teams).  Powerful civil defense apparatus should have a similar effect to structural defense, minimizing civilian damages following an attack.

3.      Raise the violence threshold required to trigger an armed response.

4.      Minimize civilian casualties during an armed response.  This stifles international outcry and helps preserve diplomatic relations.

5.      Avoid humiliating and shaming opponent.  This denies terrorists the sense of victimization on which they rely to attract support and rally for later attacks.

6.      Challenge the honesty of terrorist narratives with terrorist supporters.

These steps are guidelines for counterterrorist policy, but do not hold for interstate relations.  For defusing states, Adler proposes another style of defusing, which he terms “defusing by restructuration.”  Restructuration means more use of “soft power” and unconventional diplomacy to sway international relations in order to avoid the sub-optimal results of starting a catastrophic war or inciting repeated offenses.  Adler uses Iran as a hypothetical example: by offering Israel coverage under NATO’s nuclear umbrella, Iran may no longer feel a need for Nuclear weapons, since its primary target would possess too great a deterrent.

Why Defusing is Nonviable

While Adler’s suggestions are mostly logical, his approach to terrorism is fundamentally deterrence by denial by a different name.  Deterrence by Denial, as Adler would have us believe, is denying potential attackers the ability to attack by assuring weaponry does not fall into their hands.  This is a slight simplification, but one that leads Adler to the delusion that his suggestions are novel.  Deterrence by denial is preventing an enemy from successfully attacking (whether by preventing the enemy from obtaining a weapon, preventing the enemy from using a weapon they have obtained, or else preventing the weapon from yielding effective damage), thereby making an attempted attack a waste of effort.  For example, the U.S. missile defense shield (if it worked), would be a prime example of deterrence by denial: even if an enemy had nuclear weapons and ballistic delivery, no weapon would be able to penetrate the shield, thereby denying the enemy the opportunity to do damage with said weapon.  This reveals steps one and two to be merely common sense: weaken the weapon’s own ability to do harm, and strengthen the government’s ability to save lives in the wake of an attack, and you make attack attempts less appealing.  This is both logical and feasible: increasing spending on defense (steps one and two) should generally be a popular measure, and thus an easy sell to democratically-elected leaders.

Of the steps in defusing by denial, step three is the most puzzling to me.  By raising a nation’s tolerance for violence, it seems that Adler is encouraging attacks underneath that threshold.  Adler is not breaking out of the “damned if you do, damned if you don’t” dilemma—on the contrary, he is confining a government which acts on his suggestions to the “damned if you don’t” outcome.  His stated intent is to create an incremental reprisal system, but what politician can suffer another bureaucracy while the people are crying for blood in the wake of a terrorist attack?  He is leaving policymakers to wander into ethical questions about how many human lives an armed incursion should cost.  And regardless of the bureaucratic apparatus to respond to attacks, inviting small-scale attacks in lieu of larger ones can only yield frustration on the part of the victims—frustration which may spread throughout the voting public.  A policy that would be unpopular for emotional reasons will not likely survive in democratic governments.  Democratically-elected leaders will not likely enact policies that place civilian lives in peril, lest an attack should occur and the politician be voted out for it.

I find steps four and five to be valuable ethical considerations, but irrelevant to strategy.  I contend that affording terrorists dignity will not diminish their desire to rally new attacks.  While Lindy England’s infamous pictures of captive torture in Iraq were cited as predictors of attacks to come[2], the attack on September 11th was perpetrated long before any such scandals unfolded.  If sentiments are a component in terrorist calculations, they are not necessary to yield an attack.

Step six, the idea that successful counterterrorist policy drives a wedge between terrorists and terrorist sympathizers, is neither new nor exclusive to the social sciences.[3] Adler suggests that reports originating from news sources domestic to the terrorist-supporting countries will be critical, but offers no insights on how to generate such reports.  Herein we find a catch-22: If the people distrusted the terrorist organizations enough to report against them in the media, then step six would have been accomplished before those outlets had ever broadcast that information.

Finally, I do not find any substantive difference between Adler’s idea of “defusing by restructuration” and exercising conventional diplomacy and other manifestations of soft and hard power.  His Iran-Israel example undermines the complexities of Middle-East politics, and misrepresents Iran’s nuclear ambitions as a matter of aggression solely toward Israel.

Trapped in the Dilemma Again

I appreciate the clarity with which Adler presents his dilemma.  I also believe that he has some compelling policy suggestions for averting future attacks.  However, his analysis is incomplete.  Without consideration of how to implement his framework, Adler fails to create a viable policy recommendation holistically, though many of his suggestions have merit.  More importantly, the fundamental flaws in his suggestions mean that his prescriptions do not provide an escape from his dilemma.

References:

[1] Adler, Emanuel. 2010. “Damned If You Do, Damned If You Don’t: Performative Power and the Strategy of Conventional and Nuclear Defusing”. Security Studies. 19(2):199-229.

[2] For example, see “Abu Ghraib attack raises fears of resurgent Al Qaeda in Iraq” (http://www.csmonitor.com/World/Middle-East/2009/1117/p06s07-wome.html)

[3] See Galam and Mauger, “On reducing terrorism power: a hint from physics”.  Physica A 323 (2003) 695 – 704

### Probability of World Wars

The Pew Research Center performed a survey polling people about the fate of the world in 2050. The results are fascinating, but there’s one in particular I’d like to pull apart. 58% of people believe there will likely be another world war between now and then.

We can calculate the probability that such a war will occur within this time frame.  Back in the 40’s, a Meteorologist named Lewis Fry Richardson figured out that armed conflict follows a Power Law Distribution.  The bigger the war, the less likely such a war will ever occur.  Thus we can calculate the probability that some random war meets the scale of a world war (more than 10 million casualties).  Harvard’s Lars Erik Cederman replicated this result in a paper in which he took the Correlates of War for an even more accurate fit than Richardson found.  He generated this graph to describe it.

So, let’s assume that the term “World War” means a war in which the scale of casualties is measured in the tens of millions, like the prior two world wars.  That means that for the purposes of the above equation, $$log s \approx 7$$.  For simplicity’s sake, let’s also define the inequality $$S>s$$ as W (for World war). Plugging this into Cederman’s Model:

$$log P(S>s)=1.27-.41 log s$$

$$log P(W)=1.27-.41(7)$$

$$log P(W)=-1.6$$

$$P(W)\approx0.025$$

Thus the probability that any random war will reach the scale of a World War is .025 (or 2.5%).  OK, so what about those random wars?  The Correlates of War project lists 86 interstate conflicts that had 1000 or more fatalities over the course of 185 years (1816-2001).  That’s a rate of approximately $$86/185 =.465$$ wars per year.  So we should expect $$.465*40 \approx 19$$ wars.  Now, what is the probability that one of those will be World War III?

This is a variation on the Birthday Problem.  We can’t calculate the probability that the big war is in our upcoming wars directly.  However, we can calculate the probability that our set of predicted wars does not contain WWIII. The probability of the first war being the big war is approximately 2.5%.  So the probability of the second war being the big war is 2.5%.  And so on.  Now, the probability that the first war or the second war is the big war is not 2.5% + 2.5% (as is the intuition).  If that were the case, we would see a world war within 41 wars with probability greater than 1, which is paradoxical and therefore can’t be true.  However, the probability that neither the first war nor the second war is the big war is $$.975*.975=.951$$.  So, a generic equation for the probability that a World War does not exist given n wars would be

$$P(\neg W|n)=.975^n$$

Now, from here it’s just a small logical step to the probability that a World war occurs within n wars.  It either doesn’t happen (the probability we just calculated), or it does (the probability that’s left).  So a generic equation for the probability that some war in n wars is a World War is

$$P(W|n)=1-(.975^n)$$

We can use this equation to figure out if our estimate of 19 wars within the next 40 years contains world war.  Just set n = 19.  So there’s a $$1-(.975^{19})=.382$$ chance we will see a World war in any given modern 40 year span, the next 40 years included.  58% of the population is probably wrong.

References:

1. Cederman, L. E. (2003). Modelling the size of wars: from billiard balls to sandpiles. American Political Science Review, 97, 135–150.
2. Richardson, L.F. (1960). Statistics of deadly quarrels. Pacific Grove, CA: Boxwood Press.

### If a Nuke landed in Buckhead, would Emory survive?

A friend of mine threw out this fun fact from his International Security course: If a small nuke struck Buckhead, Emory would still have a fighting chance at survival.  However, I don’t think it’s gratifying enough to make such a claim unverified.

First, I’m argue that it is simply unlikely.  The Earth’s circumference is about 40,000km in any given direction.  That means the farthest you need to launch a missile is approximately 20,000km.  There are only a handful of countries with the firepower to launch a missile more than 1000km, much less 10,000.  They are Egypt, France, India, Iran, Iraq, Israel, Libya, North Korea, Pakistan, and The United States.  OK, so how far is 1000km from Buckhead?

Not very far at all.  Maybe the US could launch at Buckhead, but that is approximately as likely as hell spitting Michael Jackson back out.  How about we up the scale an order of magnitude?  From 10,000 km:

The French could get us with their developmental M-51, which is a MIRV (Multiple Independently Targetable Reentry Vehicle), meaning it can carry a payload of between 6 and 10 nuclear warheads, each with an expected yield of 100kt.  So, If the French targeted Buckhead, we would see something like this:

This is with a 400kt yield.  Possible, but pretty unrealistic.  First the French wouldn’t attack us (Democratic Peace, blah, blah, blah).  Second, they wouldn’t concentrate enough weaponry on Atlanta to yield 400kt when they did launch.  So, who would?  We’ve shaved the list of possible aggressors down considerably: only China, Russia, and France could hope to launch an assault, and France isn’t going to.  So, China and Russia it is.  Here are all known worldwide deployments:

OK, so the Chinese best is the CSS-x-10, which Wikipedia claims can hit anywhere in the world.  I’ll take that at face value, and run with it.  It can carry up to 1MT, meaning the blast radius would basically toast Emory.  Here’s the map for a 1.4MT blast.

So Emory would be wiped out, but only by a little bit.  However, once again, back to the probabilities game, we can assume Intelligent selection of warheads, based on population decimation.  DC And New York are the obvious first targets (by population and power distributions) , with LA, Chicago, Dallas, Philadelphia, Houston, and Miami all being jucier, more populous targets.

All of which leads me to wonder: what are the Nuclear “Safe spots”?  Where are you most likely to survive a catastrophic nuclear exchange?

PS–To find out what is fried in the event of a Nuclear blast of a certain size, check this out.

### Boosting and ADTs in MEDUSA

I presented my final presentation for CS485: Bioinformatics yesterday.  It was a fascinating little project aimed at critiquing the Motif-finding algorithm MEDUSA.  First critique: It’s a mess.

MEDUSA is a brilliant idea, coded in an impossible mix of dozens of MATLAB and Perl Scripts spread out across Pre-processing, Processing, and Post-processing modules.  It doesn’t work out of the box on Windows, Solaris, or OSX.  The problem ultimately boils down to poor architecture, hacked together with coffee, insomnia, and attention to the publication of the paper instead of the software package itself.  Not to say this isn’t the case with practically every Bioinformatics software package, but it was an unpleasant change from the usual stuff I worked with in class (The UCSD Genome Browser, WebLogo, and of course my professor’s baby, Galaxy).

Anywho, MEDUSA identifies Motifs (Transcription Binding Factor Sites) by looking for correlations between the presence of the Motif and regulators.  Then they subject the cells to environmental stresses to look for changes in present regulators.  I’m crap at Biology (my partner, Sweta, handled all of that stuff).  What I can tell you is that MEDUSA uses a blend of ADTrees and Boosting to do it.

Boosting is a Machine Learning Technique used to combine many bad guessing rules into one good guessing rule.  Say, for example, I wanted to write a program to predict the outcome of horse races.  I know nothing about horse racing, but I can call a professional gambler.  I ask for his grand secret and he says, “I get a twinge in my knee when I see a winner.”  Thanks a lot, Grandpa.  But then I hand him a sheet full of statistics on horses, to which he then applies a rule of thumb to pick a winner.  By scientific standards, this rule of thumb is pretty bad: very intuitive, only a little better than random guessing.  But this is the beauty of Boosting: that doesn’t matter.  So long as it is at least a little better than random guessing, it is good enough.

So, say I ask multiple gamblers, each of whom will give me a different rule of thumb. Some will be better than others.  I can run my training data on each (represented by x), with a simple binary outcome for each test: the rule predicted correctly, or not (represented by y).  I can select the best rules for specific circumstances.  (If you’re interested, this paper has a great pseudocode breakdown of Adaboost, one of the earliest boosting algorithms with practical potential.)

The problem is that this is inherently binary.  Adaboost simply cannot handle a case where there are more classes than 2.  Other boosting algorithms exist to circumvent this problem, but they’re all pretty much the same idea.  Reduce each class to a binary decision, y or y’.  It’s a lot like a bank that automaticall sorts your coins for you.  The coins roll down a slide, with the smaller coins falling through gaps in the slide that closely match their diameter.  Each gap is a binary decision: Does this coin belong in the wrapper below this, or not? While this is perfectly manageable with a small class count (5 coins), it increases the complexity of the algorithm by a multiple of O(c), the number of different classes.  This becomes important a little later.

Meanwhile, one of the more popular ways to organize this into functional information is a data structure called an Alternating Decision Tree.  Basically, an ADTree alternates between decision states (your rules of thumb) and prediction states (scores you attach to each decision).  By convention, they alternate boxes and ovals, respectively.  A common example of an ADTree in action is spam filtering.  At each decision state (e.g. Frequency of the dollar sign < .0105), it pushes the subject down to a prediction node, which may or may not have additional decision nodes attached to it.  Perhaps an e-mail without many dollar signs will still be spam if it contains many instances of the word “Remove”.  Thus, another node extends below.

The sum of these score in the prediction nodes (on the traversed path) gives the final prediction score.  The value of the prediction score is irrelevant beyond the sign of the score.  In other words:

Hypothesis = sign(S(p)), ere Hypothesis ε{1,-1}

What this means in Bioinformatics is that we can generate a list of weak rules, Boost them into a sequence of Strong rules organized on an ADTree, and use the ADTree to determine if a given sequence in a genome is a motif.  MEDUSA further enhances its predictive accuracy by “Agglomeration,” which is a process of using a number of similar weak learners to generate PSSM Sequence logos for the same motif, and then picking the logo with the lowest offset.  Finally, MEDUSA takes a margin-based scoring metric to determine the efficacy of its findings, which basically means it traverses the graph recursively, only subtracting off a branch (with all subbranches) at each iteration and testing the validity of the tree on the training data.*

So, in addition to the unfortunate bugs and the fact it’s written in a proprietary language, MEDUSA has some pretty unfortunate computational limitations.  In the worst-case, (|k-mers|+|dimers|+|PSSMs|)*|regulators|*2 possible weak rules at every node.  For simplicity, I’ll call that O(n2) with a very bad multiplier.  (For anyone who knows their Comptuational Theory, I apologize, I know how deterministically inaccurate this is.)  The point is, at large genome sizes, this is basically intractible with more than a single digit of regulators.  MEDUSA heuristically improves this, but hacks have a downside.  Remember that thing about Adaboost being inherently binary?  MEDUSA can only predict up-, down-, or non-regulation.  If you wanted to try to predict how much regulatory control a Motif had, all the tools are in place in MEDUSA, but at present it is simply too hard to do.  Thus, you have two diametrically opposed problems: Big complexity, and simplistic output, and any attempt to improve one will exacerbate the other.

Whew–That was a head-full.  I enjoyed Bioinformatics immensely, but I’m also glad to be done with it.

*I think.  Don’t cite me on it.  In fact, please don’t cite this at all.

### What’s the probability a single nucleotide substitution results in a different protein?

I’ve been trying to figure out what to do with my thoughts about Substitution Mutations that cause no change in the output proteins.  Wikipedia is always there to goad me on.  I see a believable description of the phenomenon I was describing, but no numbers describing the likelihood of the events.  So, I did a few back-of-the-envelope calculations and came up with this:

For the third base in a codon, there are three different ways a substitution mutation can occur (if a base is replaced with itself, there’s effectively no way to know, so it’s more useful to pretend they don’t occur at all, and then if we need a frequency metric, increase the final value by about 33%).  So, I counted all of the ways a third base can change the coded amino acid, and by fascinating coincidence, it happens to be 64.  It goes a little something like this:

There are eight amino acids with fourfold degenerate sites, meaning there are 8 amino acids * 4 nucleic base possibilities on the third base = 32 codons (out of 64 possible codons) for which no change on the third base will yield a different amino acid.

Add all of the twofold sites (Sort of 13, with a 2/3 chance of amino acid change) and the single threefold site (a headache to organize, with a 1/3 chance of an amino acid change).  Total all of these possible outcomes, and you’ve got 64 ways an Amino acid can change, and 128 ways it can remain the same after the mutation.  Thus, a total 2 out of 3 times a third base changes, the amino acid remains exactly the same.

Now, the calculations for the other bases are similar, if not a little simpler.  For the first base there are only four codons which code to four other codons representing the same Amino acid.  Thus, (8 codons * 1/3 odds of no change) / 64 possible codons = 1/24 chance of no change in output protein.  Second base is even easier.  There’s only one: UAA codes to UGA.  2 codons * 1/3 odds of no change / 64 possible codons = 1/96 odds of no change.

Combine these three: and you’ve got (64+8+2=) 74 mutations that cause no change, and (3*3*64=) 576 different possible mutations.  Thus 74/576 = 12.85% of substitutions which change a base will not cause a change in the output protein.

Now, this number is not especially meaningful.  These calculations are effectively just a toy.  There are structural constraints that make these evaluations less valid than I would like to present as a viable project.  For instance, these are simply counting outcomes: this method assumes that any base is equally likely as a candidate for replacing a missing base in a substitution mutation, which is almost certainly not the case.

### Coding Proteins and invisible mutations

In addition to being an International Studies major, I’m a CS nut.  And since I basically had the IS Major wrapped (save one elective), I decided to pick back up the Computer Science Major (noting that I could handily finish all of my requirements for both before the end of the year).  For these reasons, I enrolled in CS 485: Bioinformatics.  It’s a quirky CS elective combined with a Biology course, taught by a Computer Scientist who works more with the Med School than with the CS department.  After the gratuitous “this is DNA, four nucleic acids (bases), blah blah blah” lecture, we dove headfirst into the algorithmic backbone of modern bioinformatics.  While I pity the Bio majors, I’ve enjoyed applying the logic of Computer Science to something (slightly) more abstract and fuzzy than the cold, unambiguous ones and zeros of the computing world.

But there’s something that has been consistently bothering me about the methods of base-pair comparison between species.  The business end of DNA (evolutionarily-speaking) is the protein-coding sections (though apparently they only constitute 1.5% of the actual number of base pairs in a Genome).  Proteins are constructed by matching one of about 20 amino acids to three base pairs of RNA.  However, since there are four kinds of bases in a strand, there are 4^3 (=64) possible combinations of three bases.  That’s 64 possible values mapping to 20 possible outcomes.  Luckily, Wikipedia rides to the rescue:

So, based on this table, in half the cases the first two bases decide the amino acid, and the third is unimporant with respect to the protein it codes for.

All of this is fine, until you begin to consider the potential for mutations.  If one of the first two changes, the corresponding Amino acid usually changes.  However, if the third base changes, there will be no functional change that can be identified by the output protein, because the same amino acid will be inserted into the protein, and thus the protein will be identical to one constructed using the unmutated RNA as a blueprint.  Functionally, nothing is different, and the species is not evolving.  If there’s no difference, there’s nothing to select out.  Therefore, you can imagine cases where mutating and non-mutating cells from the same organism have only slightly better than 1 in 4 (25%) likelihood of a match on every third protein-coding base.  This is relevant because it suggests divergence where, in fact, there is none.

This is a small potential divergence, (about .2% of the total genome), but significant enough, I think, in establishing identity between members of the same species.  And since the methods we’ve analyzed in class scour the DNA sequence for “words” of specific numbers of bases, an artificial divergence like this could potentially derail a primitive approach to the comparison of sequences.  So, the question to pursue here is, is there a method in the current realm of Bioinformatics to account for this divergence?