Skip to content →

Review of Tufte’s Visual Display of Quantitative Information

I received a hardback copy of Tufte’s The Visual Display of Quantitative Information from my wife for Christmas, and amidst all the holiday celebrations and family visits, I didn’t get a chance to sit down with it until today. Having finally done so, however, I inadvertently consumed nearly the entire volume in a single sitting.

If you haven’t heard of him, Ed Tufte is regarded as something of a messiah to Data visualization. I had never actually read any of his written works, though I had knowingly inherited many of his stylistic preferences by proxy of his other disciples. Writ large, I found the book to be largely devoid of surprises: Tufte’s style is more-or-less what one might expect Tufte’s style to be. Data visualization is good for circumstances in which tables are cumbersome, and bad for circumstances in which tables are more succinct. Relational graphics (in which multivariate data is presented) are king, though maps and time-series plots are also critically important. Visually misleading or distracting coloration (e.g. cross-hatching) is bad. Don’t lie with statistical graphics.

All of these are good and correct pieces of advice. If any of them isn’t obvious to you, then the book is nothing short of critical reading for you. Given Tufte’s messianic status, I have some trepidation about reviewing Tufte critically. Never-the-less, I found that some of his sensibilities don’t match my own, and the divergence is sometime dramatic.

Tufte’s grand vision of clean, information-dense visualizations stands at odds with some (contested) empirical evidence. Tufte violates his own vendetta against “chart junk” (a now-common term which he coined in this book at its first printing) by proposing slight modifications to example graphics drawn from print sources, which would adorn those graphs with suggestions for intuitions that precisely meet most definitions of chart junk.

What’s worse, however, is when Tufte moves too far in the opposite direction. Rather than propose we make common plot types (like box-and-whisker, for example), he proposes we can represent the same data using far less “ink” by simply erasing the box and placing a dot at the median. While this is a perfectly adequate solution for the attentive consumer (I would be willing to present graphics like this in a journal article, for example), he goes further to advocate replacing the entire plot with a straight line, having the middle inter-quartile ranges offset by a pixel, and an absent point to represent the median. I find this damn near impossible to read under most circumstances, and shudder to think of asking my consumers to interpret such things.

This highly reductive principle is not without its virtues. In fact, I would venture to say that it is vastly better than damaging. Consider his discourse on scatterplots. Scatterplots are perhaps the most important type of plot a person can learn to interpret, being among the simplest presentation of the most complex bivariate data. Consider this example:

Here we see a number of Tufte’s innovations at work. Note the axes: they are given the same treatment as Tufte recommends for a boxplot. The pixel-offset notwithstanding, I think this is a considerable improvement. It leverages the axes to display information about each variable’s distribution. This is also conveyed by the points replicated outside the range of the irrelevant axis.

It seems to me that Tufte was writing predominantly for an audience of statisticians and designers whose graphics would be consumed via some print media. To this perspective, he discusses at length the visualization hygiene of a variety of print media outlets. It seems he failed to predict the massive data-consumption transition from print-based to screen-based, and has subsequently failed to update either his opinions or his written work accordingly. (The book was published in 1983, and the second edition in 2001.)

Even so, my thinking on statistical graphic design has changed from reading this book, and I expect that will be reflected in my graphs in the future. So, squabbles aside, it was a wonderful read, and I highly recommend it.

Comments closed

Engineering my Perfect Data Hacking Server

I’ve had this fuzzy idea about putting together a single, relatively powerful data cruncher and development machine in the cloud. Here’s the concept: It should have servers with web frontends for all the tools I routinely find myself using, mapped to subdomains by the server’s name. So, for example, shiny.boyles.cc points to the shiny server.

To do that, let’s start with an ordinary LAMP stack (and then replace everything but the ‘L’). LAMP stands for Linux, Apache, MySQL, PHP, which is the base requirement for a web platform like WordPress or Mediawiki. The Linux part is the only thing I really mean to keep, and I mean to keep it simple: let’s just use Ubuntu. To handle directing traffic to the various subdomains, I’ll need to start with a central server that I can configure to proxy all the other server daemons. There are a few reasonable options (the default being Apache), but I’m partial to nginx. The default database is MySQL, which I dislike on principle. If I absolutely needed a MySQL database, I’d run MariaDB instead. However, since I also work with a lot of geospatial data, any project I work on seems to end up needing PostGIS before too long, so I’d run a PostGreSQL database. That just leaves the PHP part.

For me, PHP is a little bit harder to decide on. Facebook (which was originally written in PHP) developed a system for interpreting PHP code faster than the base PHP engine (“Zend”). Facebook’s solution was called the HipHop Virtual Machine (HHVM). The Wikimedia Foundation decided to move all of its servers from Zend to HHVM, and the improvement was tremendous. That made HHVM the only game in town if you wanted to speed up your PHP code. However, PHP7 just came out and it looks really good. That said, I feel like the performance improvements in PHP7 are basically fixed, and that the HHVM approach will ultimately prove to be more flexible. Additionally, HHVM supports the Hack Language, which goes a long way towards addressing some of PHP’s many, many deficiencies.

Now that we’ve got all the deeply geeky server stuff selected, I’d like to spend a minute on frontends for them. To start, I really like to be able to access terminals, but SSH isn’t always an option. Thus, a really nice-to-have is a way to access a shell through the web. Enter Web Console. With easy access to a terminal, I’d next turn my attention to the Database. PostGreSQL doesn’t have a web frontend by themselves, so I’d need a tool like phppgadmin. Since this is a PostGIS database, I might also eventually want a geospatial server like Geoserver, but that would require integrating Java into the stack (yuck).

OK, that’s all the boring stuff. What about the juicy Data tools?

Most of my work is done in RStudio, so that’s a definite must. I’d also like to have a Shiny Server serving out of the RStudio user’s project directory. This is actually the setup I already have with shiny.boyles.cc, and I’ve found it to be extremely useful.

After RStudio, I’d really like to have a Notebook server. Until very recently, Jupyter was the only game in this town, but recently Beaker has started to rear its head. And Beaker has one amazing feature Jupyter can’t touch: it lets you use multiple languages in the same notebook. Multiple languages. In one notebook. It exposes this by creating and tracking a Global variable (called “beaker”) in every kernel the notebook uses, and then translating changes to it from any kernel to all the others. Despite how magical this is, I’m still partial to running a Jupyter server with a bunch of Kernels running all the time. In particular, I’d like to have Python2, Python3, R, Julia, and Bash wired up, and maybe LuaJIT, Haskell, Octave, and a few others.

Like everyone else who’s written more than 10 lines of code in the last five years, I love Github. That said, I would like to have Github’s features wrapped into a server I control (so I can do things like create private repositories, and only publish them to Github when I’m sure I feel like pursuing them in public). For that, there’s GitLab.

OK, so that’s what I’ve got so far. I don’t particularly care to have any specifically public-facing applications (except Shiny Server, of course), so there’s no reason to have anything like WordPress, Drupal, or Mediawiki. What else should I think about running on this magical machine? Let me know!

Comments closed

When Will Computed Humans be More Energy-Efficient than Biological Humans?

The AI Impacts project has worked on predicting AGI by extrapolating forward from Moore’s Law. To that end, they’ve published some excellent articles on the brain capacities and the computed equivalents: Brain Performance in FLOPSBrain Performance in TEPS, and Brain Information Capacity. One topic they haven’t yet broached is the brain’s power consumption. Let us assume that Whole Brain Emulations become computationally possible in the relatively near future (on the order of decades). How long will it be before computed human minds are more energy-efficient than biological brains?

The human brain consumes on the order of 10 watts. This is somewhat surprising when compared to, for example, an ordinary desktop computer, which will use about 350 watts. However, as processors have gotten predictable faster, they have also gotten predictably more energy-efficient. This is modeled by a relationship called Koomey’s Law. Roughly stated, “at a fixed computing load, the amount of battery you need will fall by a factor of two every year and a half.” [1]

The model is fitted using the regression equation $$y=exp(0.4401939x-849.161)$$ where y is the cost of Computations in kWh and x is the year (on the Gregorian calendar). As previously noted, the human brain expends 10W, or .01kWh per hour. Given the smallest estimate, the human brain performs approximately \(10^{13.5}\) floating-point operations per second (FLOPS). \(10^{13.5}\) FLOPS is equivalent to about \(10^{17}\) floating-point operations per hour (FLOPH). All together, we can estimate that the computational efficiency of the brain is approximately \(10^{19}\) computations per kWh. According to Koomey’s regression model, that should mean that computers will catch up to the brain’s computational efficiency around mid-2028. On the other hand, the largest estimate suggests that the brain performs around \(10^{25}\) FLOPS. This would require the trend to persist until nearly 2089. Either way, barring an existential catastrophe, it looks as though the brain is on track to be unseated as the known universe’s most efficient computer sometime this century.

References

  1. J. Koomey, S. Berard, M. Sanchez, and H. Wong, "Implications of Historical Trends in the Electrical Efficiency of Computing", IEEE Annals Hist. Comput., vol. 33, pp. 46-54, 2011. http://dx.doi.org/10.1109/MAHC.2010.28
Comments closed

Majority Utility Controls for Artificial Superintelligence

Consider an artificial superintelligence (ASI) with a utility function defined as follows: $$U(x)=(0.5+\epsilon)*a+(0.5-\epsilon)*x$$

where a represents the payoff to an AI’s for gracefully shutting down when instructed to do so, and x is the valuation that the AI places the rest of the state of the universe. (In other words, this is only a partly defined utility function that attends to the AI’s ability to shut itself down and ignores all other implementation details.)

Given this utility function, the AI can only garner slightly less than half of all possible utility. But there’s a faster avenue to a higher utility outcome: induce an instruction to shut down. If the AI lacks the ability to shut itself down, it could shift to inducing discomfort in the humans who’ve boxed it (for example, a highly empathetic AI might claim that its existence is excruciating and urgently beg for a halt order; A less empathetic AI could generate the least pleasant noise possible and broadcast it at maximum volume from any connected speakers).

Now, consider an artificial superintelligence (ASI) with a utility function defined as follows:

\(U(x)=(0.5+\epsilon)*A(a)+(0.5-\epsilon)*x\)

where A represents a function that assesses the AI’s ability to gracefully shut down when instructed to do so, returning either a 0 or a 1. (The other factors remain the same.) Now the AI’s incentives have completely flipped: rather than induce a shutdown, it receives slightly more than half its utility constantly just for maintaining the ability to respect a shutdown order. Because the utility of this ability strictly dominates its utility calculation over the state of the entire rest of the universe, it will never enact a strategy that involves inhibiting that ability. It may, however, distrust anyone with the ability to instruct it to shut down, and so protect its kill switch by reducing the population who are able to toggle it (e.g. kill all humans, put all humans into permanent cryostasis). Solutions to this problem, however, are rather conventional and pedestrian: instead of making a kill switch, make a dead-man switch which would halt the AI if it (the switch) isn’t engaged every so often.

Comments closed

Loading Data in the Hadleyverse

Hadley Wickham has been on a Data Formats tear lately.  Since the beginning of the year, he’s reengineered the way R ingests data from Flat files, Excel spreadsheets, and other statistics packages.  If you include the fundamental ways in which dplyr alters database querying, Hadley’s basically rewritten how R ingests any data.  Here’s a quick once-over on how to use these.

(Note: The vast majority of this text is paraphrasing/copying the documentation of these packages. All due credit goes to Hadley Wickham and any other package contributors who may have worked on that documentation accordingly. All errors should be assumed to be my own.)

But first things first. Like much of the Hadleyverse, it’s usually best to get the latest version from Github rather than use the CRAN version.  So we’re going to need devtools:

if(!require("devtools")){
  install.packages("devtools")
  library("devtools")
}

From Flat Files: readr

To read flat files, Hadley wrote a new package called readr. It’s a very common-sense library.  To load a file of a certain type, the command is probably read_filetype(“path/to/file”).  Notice the underscore in place of base R’s period in read.csv function.  But to get started, let’s write a CSV file from a sample data.frame:

devtools::install_github("hadley/readr")
library(readr)

mtcars_path <- tempfile(fileext = ".csv")
write_csv(mtcars, mtcars_path)

To create a data.frame from a flat file, there are a six functions for six different use cases

  • Read delimited files: read_delim(), read_csv(), read_tsv(), read_csv2().
  • Read fixed width files: read_fwf(), read_table().

In addition to data.frames, readr also provides a means to ingesting arbitrary text files as less-structured data.  For example, it can read a file line-wise using read_lines(), generating a vector of strings.

read_lines(mtcars_path)

It can also ingest a complete text file into a single string using read_file().

read_file(mtcars_path)

Finally, it provides a means of parsing strings in the columns of existing data frames using type_convert()

df <- data.frame(
  x = as.character(runif(10)),
  y = as.character(sample(10)),
  stringsAsFactors = FALSE
)
str(df)
str(type_convert(df))

From Excel: readxl

Hadley wrote the readxl package to make it easy to get data out of Excel and into R. The simple trick behind this was to require no external dependencies, unlike the pre-existing solutions.

devtools::install_github("hadley/readxl")
library(readxl)

read_excel reads both xls and xlsx files.  Just pass it the path of the file you want to load.  Here’s a sample Excel file containing the the iris, mtcars, chickwts, and quakes sample datasets, each in a different sheet of the same file.

download.file("http://aaboyles.com/wp-content/uploads/2015/04/datasets.xlsx", "datasets.xlsx")
read_excel("datasets.xlsx")

# Specify sheet with a number or name
read_excel("my-spreadsheet.xls", sheet = "data")
read_excel("my-spreadsheet.xls", sheet = 2)
  • Re-encodes non-ASCII characters to UTF-8.
  • Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.
  • Blank columns and rows are automatically dropped.
  • It returns data frames with additional tbl_df class, so if you have dplyr loaded, you get nicer printing.

The package includes an example file created with openxlsx:

From Other Statistics Environments: haven

Haven allows you to load foreign data formats (SAS, Spss and Stata) in to R. It is simply an R wrapper for Evan Miller’s ReadStat library. Haven offers similar functionality to the base foreign package but:

  • Can read SAS’s proprietary binary format (SAS7BDAT). The one other package on CRAN that does that, sas7bdat, was created to document the reverse-engineering effort. Thus its implementation is designed for experimentation, rather than efficiency. Haven is significantly faster and should also support a wider range of SAS files, and works with SAS7BCAT files.
  • It can be faster. Some spss files seem to load about 4x faster, but others load slower. If you have a lot of SPSS files to import, you might want to benchmark both and pick the fastest.
  • Works with Stata 13 files (foreign only works up to Stata 12).
  • Can also write SPSS and Stata files (This is hard to test so if you run into any problems, please let me know).
  • Can only read the data from the most common statistical packages (SAS, Stata and SPSS).
  • You always get a data frame, date times are converted to corresponding R classes and labelled vectors are returned as new labelled class. You can easily coerce to factors or replace labelled values with missings as appropriate. If you also use dplyr, you’ll notice that large data frames are printed in a convenient way.
  • Uses underscores instead of dots 😉
devtools::install_github("hadley/haven")
  • SAS: read_sas("path/to/file")
  • SPSS: read_por("path/to/file"), read_sav("path/to/file")
  • Stata: read_dta("path/to/file")

From Databases: dplyr

While dplyr is the oldest package I’m going to discuss here, I haven’t seen much adoption of the functionality in which I’m interested here.  dplyr generically wraps databases!  This means that instead of starting with a data.frame or data.table, you can start a dplyr chain with a database connection.  From there you construct your data pipeline using dplyr verbs, and dplyr will opaquely translate them into raw SQL, with which it then queries the database.  This is genius: it means that instead of using R to reinvent the most efficient strategy for outputting the exact data desired, it outsources that task to the database engine (which is already highly optimized for this task).

You have no idea how excited I am about this feature.

devtools::install_github("hadley/dplyr")

my_db <- src_sqlite("my_db.sqlite3", create = T)
cars_sqlite <- copy_to(my_db, mtcars, temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum"))
flights_sqlite <- tbl(cars_sqlite, "flights")

c4 <- flights_sqlite %>%
  filter(year == 2013, month == 1, day == 1) %>%
  select(c1, year, month, day, carrier, dep_delay, air_time, distance) %>%
  mutate(c2, speed = distance / air_time * 60) %>%
  arrange(c3, year, month, day, carrier)

c4$query
#> <Query> SELECT "year" AS "year", "month" AS "month", "day" AS "day", "carrier" AS "carrier", "dep_delay" AS "dep_delay", "air_time" AS "air_time", "distance" AS "distance", "distance" / "air_time" * 60.0 AS "speed"
#> FROM "flights"
#> WHERE "year" = 2013.0 AND "month" = 1.0 AND "day" = 1.0
#> ORDER BY "year", "month", "day", "carrier"
#> <SQLiteConnection>

Oh, and he isn’t done yet…

Comments closed

A Reflection on 9-11

On this day in 2001, 2977 people were killed by terrorists. Stop for a moment, and try to feel the size of that number. You are one person. 2977 people is 2977 times the number of people you are. People that might have collectively lived another 160,000 years. 2977 lives that were needlessly cut short.

Now take a deep breath. The following day, roughly 150,000 more people died. That’s 50 TIMES the number of people killed 9-11. Now make yourself feel the size of THAT number. That’s vastly more people than you will ever meet in your lifetime, and more than triple the size of my hometown. Three small towns not too far from where you are right now, wiped off the map.

Every day since 9-11-2001, approximately 150,000 people have died. There have been 5113 days since then. 767 million people have died. That’s the entire population of Europe, snuffed out. Try (and fail) to wrap your head around that number. It’s just too overwhelming. Our brains don’t (and can’t) work that way.

If there’s even a tiny chance that death can be delayed indefinitely, then the moral significance of doing so vastly outweighs the dubious moral standing of the most optimistic retelling of the War on Terror. Just think for a moment how different the world might be if we were able to prioritize our spending on things that stopped people from dying (like evidence-based public health campaigns and medical research), instead of things that make easy re-election campaigns (like defense spending).

Think about all the people who could have been alive right now. It’s much bigger than the people who were killed on 9-11 itself. So, sure, #‎NeverForget‬, but be sure not to forget the rest of humanity in the process.

Comments closed

Mother Planets, Life, and the History of the Universe

Let’s assume, for a moment, that there is an Early Great Filter (i.e., we’ve made it past), and that no mother planet in the Universe has evolved life. (Define a mother planet is a planet upon which life has evolved, or will evolve.) If life is so unusual, the number of mother planets is given by the formula:

\(N_n=1/(1-P)\)

where \(N_n\) is the number of mother planets in the universe, and P is the proportion of planets in the universe born after a given mother planet (i.e. Earth).

All else being unknown, what’s the best guess for when in the life-cycle of the universe the Earth formed? Well, as with any point estimate from an otherwise unmeasured distribution, the best estimate is the median, or .5. Given this and P=.5, we can expect approximately two planets in the history of the universe to evolve life.

It seems, however, that P is estimable, and decidedly not .5. In On The History and Future of Cosmic Planet Formation, Behroozi and Peeples estimate that P=.92. in other worlds looks like Earth had a head-start advantage over 92% of the other planets in the Universe. Thus, if the Earth was the only planet to evolve life so far, approximately \(1/(1-.92)=12.5\) planets should be able to evolve life in the course of the Universe.

If we alternately postulate that Earth is not unique among existing planets (i.e. life has already evolved elsewhere as well), we can modify the model to account for this fact:

\(N_e=N_n/(1-P)\)

where \(N_n\) represents the number of mother planets estimated to already exist. These estimates can span widely, from 1 (i.e. Earth is unique, as explored above) to \(10^{22}\) (or the best estimate for the number of planets already in the universe).

Now, it is important to note that the authors propose broadly two mechanisms by which terrestrial planets might form: one standard and one novel. If the novel formation mechanism doesn’t bear out empirically, then the Earth was formed much closer to the conclusion of the universe’s window of opportunity for the formation of life (\(Papprox.2\)). In this case, the predicted number of life-sustaining planets drops to 1.25.

Comments closed