TopoJSON notes, watershed boundaries and HUCs

I played around with Mike Bostock’s TopoJSON today, a file format + Javascript library for efficiently encoding maps. It basically does the same thing as geojson, but gets amazing compression in part by identifying common line segments between polygons and only encoding them once (ie: draw the border just once) and in part by quantizing and simplifying all the data. It’s quite clever!

My test data is the Watershed Boundaries from NHDPlusV2; the California-only NHDPlusV21_CA_18_WBDSnapshot_01.7z (89MB) and the national NHDPlusV21_NationalData_WBDSnapshot_Shapefile_01.7z (1.7GB).

Converting

Converting the smaller file to TopoJSON was a snap, just

topojson WBD_Subwatershed.shp -p HU_10_NAME -p HU_12_NAME --id-property "HUC_12" >| ~/src/huc-map/WBD_California.geojson

The flags are to add some GeoJSON properties to the resulting data file.

Simplifying

Taking the default arguments, TopoJSON reduced my 89MB SHP file to a 9.3MB TopoJSON file. Not bad, particularly since the straight GeoJSON is 257MB. A lot of compression thanks to shared boundaries. But also quantization; by default the topojson tool quantizes data to 10,000 values; I’m not sure but I think that means “it’ll look right if your map isn’t more than 5000 pixels wide”.

But topojson doesn’t do any simplification by default. What if we add the -s flag to simplify? Here’s the file size for variants of the -s parameter (in steradians):

0.0001         1.0 MB
0.00001        1.0 MB
0.000001       1.0 MB
0.0000001      1.1 MB
0.00000001     1.4 MB
0.000000001    2.2 MB
0.0000000001   5.8 MB
0.00000000001  5.8 MB
0.000000000001 5.8 MB

To my eye, the 1.4 MB 0.00000001 version looks about right, at least on my 800×800 map. I’m not surprised simplifying to a gross accuracy like 0.001 doesn’t add much; the objects themselves are pretty small. Not sure why it tops out at 5.8 MB or almost half the size of no simplification; maybe the source data has some really silly little details?

Querying the JSON

Quick diversion; I wanted to query my JSON object, to see what the properties and stuff inside it were. Someone’s finally built a decent “grep for JSON”, jq, the “keys” function is particularly useful for exploring a JSON file. Some quicky bits:

# get all HUCs:
jq '.objects.WBD_Subwatershed.geometries[].id'

# get all properties:
jq '.objects.WBD_Subwatershed.geometries[0].properties | keys'

Big data

I foolishly spent half my afternoon trying to convert that 1.6 GB file for the whole country. TopoJSON is memory-hungry; the optimization it’s doing is global so it really wants to read everything in to RAM. Long story short, the main thing I learned is that just like a Java VM the V8 VM has a fixed memory pool that’s like 1 GB or something by default. The magic flag to node is ” –max-old-space-size”. If I set it to 5000 (MB) then the whole process seemed to want to take 10 GB and made my 8GB machine thrash. Unfortunately smaller numbers like 4000 or 3500 resulted in an out-of-memory error. Mike guesses I’d need 16 gigs of RAM to do the conversion. All sort of silly anyway; I’m not sure what I’d do with the giant resulting file.

On the way I learned a lot of ugly things about Mac memory management:

  • Activity Monitor reports an “inactive memory” number that’s complete nonsense. It’s supposed to be RAM available for other processes but it sat at 3GB while my user process was thrashing the swap. So either the VM is not letting me use memory, or else it’s not really inactive afterall. The “purge” command doesn’t really help here, btw.
  • The page ins / page outs numbers on Activity Monitor are hugely helpful. When it’s paging in at 30 MB/s (hard drive rate) you know something is bad.
  • The %CPU number is also useful. When thrashing my topojson process dropped to 25% CPU (from 100).
  • If you double click on a process you get some detailed data. The “memory” page is confusing; I kept ignoring the “virtual memory size” of the process when that turns out to be the most relevant number. That was the 10 GB that resulted in thrashing. The Real Memory Size never got over 6 GB despite having 8 GB of RAM; guess MacOS overhead is 2GB now.
  • The geniuses at apple renamed “vmstat” to “vm_stat” and made it work differently. But it’s somewhat adequate.

Drawing Maps

Haven’t gotten very far for actually making maps yet. I followed Mike’s Let’s Make a Map tutorial and got a maplike something up pretty quickly. Counties in red, watersheds in black outline. My main goal is to add some color interaction around the hierarchical HUC coding.

Screen Shot 2013-02-17 at 6.19.46 PM