d3.js: using CSV for numbers

D3 has two main ways to load data; JSON files or CSV files. JSON is generally better but bloated for long repeated rows of data, for which CSV is better because it only prints the key names once. But CSV has a drawback which is that it’s untyped. So D3 just lamely loads everything as strings. Most of the time Javascript’s automatic string to float conversion lets you ignore that but it fails in some places, like sorting with d3.ascending.

Fortunately D3′s CSV library lets you create an accessor function which manipulates the data as you load it. Here’s a function which will load every row and try to convert things to numbers. I wrote it in very verbose Javascript to make it clear what’s going on. I don’t care about the number of lines, but I do wonder if this is less efficient than possible.

// convert strings to numbers in CSV
function convertNumbers(row) {
  r = {}
  for (var k in row) {
    r[k] = +row[k];
    if (isNaN(r[k])) {
      r[k] = row[k];
  return r;

d3.csv("data.csv", convertNumbers, function(error, inputData) {
  // do work on the data here

Note that the accessor function has to create a new object; you can’t just return the one the accessor passes you.

Homebrew observations

You can’t move a Homebrew binary directory. I installed Homebrew in /usr/homebrew2 and then renamed that to /usr/homebrew and a bunch of stuff broke because it had the path /usr/homebrew2 baked into it. bash-completion, also python.

Many Homebrew packages don’t compile cleanly. I ran into three: mg, unison, and ggobi. I filed bugs and got prompt responses (and a patch for ggobi in 12 minutes!), which is pretty great. OTOH it feels to me like Homebrew doesn’t have any sort of automated build-and-test setup for packages, these were all really simple bugs that anyone installing the code would hit. Some basic integration testing would go a long way.

I’m very grateful to Homebrew but the whole project feels pretty ad hoc and brittle to me. No surprise since it started out being such a simple thing. And simple is good! But I think at this point it’s evolved into needing to be a serious package manager, but hasn’t yet evolved all the serious package management. I wish we’d just use dpkg on MacOS instead. Fink is exactly that, but my experiences with it have been bad.

Starting anew with a clean homebrew install. Note I’ve already set PATH and BREW to use /usr/homebrew, the non-standard location

sudo mkdir /usr/homebrew
sudo chown nelson /usr/homebrew
cd /usr/homebrew

git clone https://github.com/Homebrew/homebrew.git
mv homebrew/* homebrew/.git* .
rmdir homebrew

brew tap homebrew/dupes
brew install atk autoconf automake bash-completion cairo clens cloog faac ffmpeg fontconfig freetype freexl gdal gdbm gdk-pixbuf geos gettext gfortran ggobi giflib git glib gmp gobject-introspection gpp graphicsmagick grc gtk+ harfbuzz icu4c iftop isl jpeg jq json-c lame less libffi libgeotiff liblwgeom libmpc libpng libspatialite libtiff libtool libxml2 lzlib makedepend mp4v2 mpfr nmap node objective-caml openssl optipng ossp-uuid p7zip pango pgdbf pixman pkg-config postgis postgresql proj pv python readline rsync sqlite texi2html unrar wget x264 xvid xz yasm

brew install mg
brew install unison

Battlefield 4 graphs

Some quick data views, looking at player statistics from the game Battlefield 4. I’m scraping stats from a site that lets me get a bunch of numbers for players, statistics like “Kill/Death Ratio” and “Score per Minute” and the like. I collected a bucket of stats for the top 1000 players (by rounds completed) for all 5 platforms, then graphed various variables. Honestly I haven’t found a very interesting story in this data yet, the main thing I’ve learned is that time played is not really correlated with any measure of skill. Ie: no one learns to play better. But more to consider.

Scatter plot is for two variables (lableled on axes), histogram below is the x axis.

Kill/Death Ratio vs. Time Played Win/Loss Ratio vs Kill/Death Ratio

Interactive data exploration

I’m exploring a multivariate data set. Battlefield 4 data, to be exact, I have a dataset for thousands of players with such statistics as “time played”, “kill/death ratio”, “win/loss ratio”, etc. Fun stuff. Now I’m exploring it looking for intresting patterns, clusters, etc. And doing it by hand-writing Javascript code with D3.

I wish there were a solid consensus tool for exploring datasets and showing graphics. Excel and R seem to be the two most popular. But Excel is too primitive and R is too much like programming already.


Wanted: browser based page scraping

Doing yet another HTML scraping project, contemplating the slowness and desolation that is BeautifulSoup or spend hours learning scrapy or surely there’s something better by now?

There is, my web browser, with DOM CSS queries. Just load the page and do querySelector and you’re done. Most modern HTML is quite nicely scrapable in browser Javascript. The problem is you can’t effectively script a browser to process thousands of pages. I’d hoped node.js would offer a solution but they don’t have some battle-hardened HTML parser like a browser has. There are some options, wonder if any are worth the time to learn about.


Chrome extension gripes

Working along on my Chrome extension, generally impressed with how well thought out and thoroughly documented the extension support is. That being said, some wrinkles…

There’s no support for reloading extension code when it’s changed during development. There’s some hacks, I’m using Extensions Reloader which gives me a button to press to reload the Javascript code. But it won’t reload the manifest, so it’s not a complete solution. And even then I have to hit the extension reload the button then refresh the page to debug stuff, it’s awkward.

Chrome has adopted an awkward API for signaling errors in the extension API, the variable chrome.runtime.lastError is set. This makes checking for errors a huge nuisance, but maybe it’s the only way to do this given the weird lifecycle of extensions? Good thing Javascript is single threaded :-P. It’s a shame Javascript’s built in exceptions are not very useful. I like the D3 pattern of setting an error object on the callback function, at least it makes the error variable a bit more explicit.

The Chrome extension storage API is fairly capable, I particularly like that it lets you store data in a place that Chrome synchronizes between the user’s browsers. But the API is just enough different from DOM localStorage to be a bit obnoxious. Also the get() method is incredibly slow, like 500ms to retrieve my 10 bytes of state. That means I can’t just store stuff there and fetch it from my content script, too slow for something whose purpose is to modify a page’s presentation. So I have to create a background page and move the config fetching there, an unwelcome complication. Update: it’s not always so slow. Sometimes it’s only 40ms. Sometimes its 200ms. I’m running a bunch of other extensions, should test it more cleanly.


Dog food

I found the perfect way to encourage myself to work on this Chrome extension. Install a version whose UI is so eye-bleedingly awful I can’t help but fix it.

Screen Shot 2014-04-08 at 9.07.47 AM