Vector tile page sizes

I instrumented the Leaflet version of my vector tile client to show me the size of the vector tiles for the page. Chrome developer tools will do this measurement, but it’s easier to have it in console.log. All measures are uncompressed and uncached times, and for my relatively limited set of rivers (based on Strahler filtering). I intend to add more data to the vector tiles; measuring this as a baseline.

All measurements taken with a browser window about 1000 x 1000, loading about 25 tiles for the view. I slowly zoomed in on St. Louis, a fairly dense area to load.

  Size    Time   View                 Cutoff
1113kB  1237ms   4/38.89/-98.75       strahler >= 6
1117kb  1494ms   5/37.892/-94.241
1741kb  1798ms   6/38.100/-92.230     strahler >= 5
 830kb  1047ms   7/38.376/-91.214
 334kb   479ms   8/38.565/-90.722
 283kb   427ms   9/38.6340/-90.4738   strahler >= 4
 239kb   332ms  10/38.7701/-90.3962   strahler >= 3
  90kb   233ms  11/38.8445/-90.5332
  60kb   301ms  12/38.8691/-90.5842   strahler >= 2
  53kb   686ms  13/38.8697/-90.6169   strahler >= 1

I guess I knew that the map got lighter weight as I zoomed in, but not how much. Too bad; I’m most excited to load data in the zoomed out views. But looking at this I can move the cutoffs some and show more rivers. Update: I updated some of the thresholds around z=8 to z=13.

A large tile like 4/3/5.json gzips from 384k to 92k. A smaller tile like 8/40/97.json gzips from  29k to 7k. So let’s say GeoJSON gzips to 25% the original size; that means the biggest views will take about 500k of data on the wire to download. That’s pretty reasonable.

A whole separate question is render time. The Leaflet render is really bad and I don’t know why. The new D3 client is so fast it gives me some aspiration.

 

TopoJSON for rivers

Mike Bostock took a crack at using TopoJSON to encode the NHDFlowline dataset. Just the geometry for rivers in 2 dimensions; no properties, etc. Tested just for California. All sizes are one million byte megabytes.

  • Source shapefile: 132M, 72M gzipped.
  • Naive GeoJSON conversion: 184M, 56M gzipped.
    ogr2ogr -dim 2 -f GeoJSON -select ” new.geojson NHDFlowline.shp
  • GeoJSON rounded to 5 digits: 95M, 21M gzipped.
    liljson -p 5 < new.geojson
  • GeoJSON rounded to 3 digits: 80M, 9M gzipped.
    liljson -p 3 < new.geojson
  • TopoJSON: 20M, 3.3M gzipped.

So for this data TopoJSON is about 1/4 – 1/5 the size of the equivalent GeoJSON. And those advantages persist through gzip. And that’s for a pathologically bad case, where there’s no shared topology along polygon boundaries. Pretty much all the savings here must be coming from the delta encoding. Neat!

Update: Mike converted the whole US. 2.5G of .shp file input, 327M of topojson output.

Note that TopoJSON quantizes. Mike used the default TopoJSON settings which I think work out to about 10,000 x 10,000 resolution, which makes the comparison to GeoJSON rounding to 3 digits about fair. Here’s a snapshot of a render of the TopoJSON that Mike gave me. It looks right.

TEMP-Image_1_1

Moar detail

I love this one screenshot from when I first got nice rendering implemented, showing all sorts of detailed rivers. I’ve backed way off that because it’s so much data, etc. But in filtering so much I’ve lost something.

Mike Bostock did a pure-D3 vector tile renderer, so I bashed up a river renderer. And it’s fast. So I made some screenshots where I don’t filter the data on Strahler number at all; all rivers, all the time!

The server generation time is acceptable once you rely on caching. Data sizes are pretty bad though, 27 megs a tile :-P

 7/25/49:   1684 ms    640 kb   1928 features
 6/13/23:   6243 ms   2308 kb   5405 features
  5/6/11:  21221 ms   8841 kb  24139 features
  5/7/12:  37041 ms  15146 kb  42131 features
   4/3/6:  64572 ms  26870 kb  75744 features

 

ms

Yet more merge strategy timings

To complement my previous timings, some more measurements of new merge strategies.

Merge by (huc8, gnis_id, strahler)
 7/25/49:    275 ms     30 kb     50 features
 6/13/23:    453 ms     95 kb    148 features
  5/6/11:    783 ms    209 kb    238 features
  5/7/12:   1048 ms    202 kb    434 features
   4/3/6:   1333 ms    376 kb    783 features

Merge by (gnis_id, strahler) if gnis_id is not null,
otherwise merge by (huc8, strahler)
 7/25/49:    103 ms     29 kb     45 features
 6/13/23:    262 ms     91 kb    131 features
  5/6/11:    543 ms    199 kb    178 features
  5/7/12:    534 ms    186 kb    339 features
   4/3/6:    958 ms    346 kb    600 features

That second strategy seems like the right compromise. The performance numbers are pretty good, the map is accurate, and the actual SQL queries are reasonably simple. It takes 11 minutes for me to merge all 2.7M rivers this way. The resulting table has 332,615 rows in it; that’s 322,621 rows with a gnis_id of some sort and 9994 without.

I sure wish Postgres could do this merging with a single query. The query planner seems to be doing the wrong thing and Postgres infamously has no syntax for giving hints to it. So I issue 170,000 queries, one per gnis_id, instead of doing a single insert. Here’s the query that doesn’t run in a reasonable time, combined with the explain for it. I have indices on gnis_id and strahler both, but maybe the group by on the pair confuses it.

insert into m(gnis_id, name, strahler, huc8, geometry)
       select
         MAX(gnis_id) as gnis_id,
         MAX(name) as name,
         MAX(strahler) as strahler,
         MIN(huc8) as huc8,
         ST_LineMerge(ST_Union(geometry)) as geometry
       from rivers
       group by (gnis_id, strahler);

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Insert on m  (cost=2863631.25..2944365.87 rows=124821 width=102)
   -&gt;  Subquery Scan on &quot;*SELECT*&quot;  (cost=2863631.25..2944365.87 rows=124821 width=102)
         -&gt;  GroupAggregate  (cost=2863631.25..2943117.66 rows=124821 width=491)
               -&gt;  Sort  (cost=2863631.25..2870305.66 rows=2669766 width=491)
                     Sort Key: (ROW(rivers.gnis_id, rivers.strahler))
                     -&gt;  Seq Scan on rivers  (cost=0.00..206104.66 rows=2669766 width=491)
(6 rows)

Notes on NHDPlus contents, grouping criteria

Basic stats about the tables from NHD I’m working with

NHDFlowline

  • 2,993,526 rows
  • 2,993,525 distinct ComIDs
    (the duplicate is ComID 17029298, Squaw Creek, and it’s not quite a duplicate row.)
  • 2,527,374 distinct ReachCodes
  • 181,685 distinct GNIS_IDs
  • 70,822 distinct GNIS_names
  • Almost all the flowlines are < 5km in length
    737,072: 0-1km
    767,064: 1-2km
    696,161: 2-3km
    366,717: 3-4km
    184,901: 4-5km

PlusFlowlineVAA

Pages 101 and following of the NHDPlusV2 User Guide are very useful in understanding how the PlusFlowlineVAA data is calculated.

  • 2,691,346 rows (one for each flowdir == “with digitized”, according to docs)
  • 2,691,345 distinct ComIDs (17029298 is duplicated again)
  • 2,691,346 HydroSeqs. This number acts like a flowline ID for a lot of the VAA data, with the added advantage that the numbers are sorted upstream to downstream.
  • 2,626,049 distinct FromNodes
  • 1,686,246 distinct ToNodes
  • 1,013,062 distinct LevelPathIDs
  • 14,676 distinct TerminalPathIDs. None are null.
  • There are 663,408 rows with gnis_id of null and a terminalpathid of 350002977. I’m guessing these are poorly labelled tributaries for the Mississippi, but I’m not sure. It’s too many to merge!

Reachcodes

The reach code is an alternate ID for a flow that hierarchically encodes its watershed. For instance the South Yuba River has reach codes like 18020125000030, 18020125000031, etc. The first 8 characters of a reach code are separable and are either identical to or closely related to the HUC8 that the WBD datasets use. Grouping flows by HUC8 makes a lot of sense, they are geographically clustered and related. Note that reachcode shows up both in nhdflowline and plusflowlinevaa; I’m not sure if they are identical or not.

  • There are 2,239,489 distinct reachcodes in PlusFlowlinesVAA. The biggest is 10150005000096 which matches 100 rows.
  • There are 2117 distinct HUC8s in the reachcodes. The biggest HUC8 is 09030001 which has 5784 rows in it.
  • There are 32,439 distinct 12 character prefixes for reachcodes. The biggest one 020801110150 has 457 rows. This seems like the right size of blob for me to merge geometry on. I’m not positive it’s really correct to group by this 12 character prefix; while you see HUC12 hierarchical data in watershed boundaries, I’m not certain these reachcodes are doing the same thing. OTOH our grouping criterion can be a bit arbitrary; all we really need is some geographic locality.

Grouping criteria

So grouping by gnis_id isn’t sufficient because so many flows have that as null. How about grouping by some combination of a substring of reachcode, gnis_id, and strahler number? Here’s the sizes of various grouping options:

  • 322,625 groups, biggest is 1153 rows
    where gnis_id is not null group by (gnis_id, strahler)
    this is the flawed grouping I was using. It leaves out 1.5M flows where gnis_id is null
  • 333,405 groups, biggest is 2882 rows
    group by (substring(reachcode from 1 for 8),gnis_id,strahler)
    Grouping by HUC8, gnis_id, and Strahler number. Most “correct”?
  • 454,018 groups, biggest is 391 rows
    group by (substring(reachcode from 1 for 12),gnis_id,strahler)
    Grouping by first 12 of reachcode, gnis_is, strahler.

Based on that analysis, the HUC8 is the right choice. There’s a reasonable number of groups, no group is too enormous, and HUC8 is a meaningful thing to group on.

So now for every single HUC8, I’m executing this insert:

insert into merged_rivers(gnis_id, name, strahler, huc8, geometry)
select
  MAX(gnis_id) as gnis_id,
  MAX(name) as name,
  MAX(strahler),
  MAX(huc8) as huc8,
  ST_LineMerge(ST_Union(geometry)) as geometry
from rivers
where huc8 = %s
group by (gnis_id,strahler)

I’d rather do this as one big create table with a group by (huc8, gnis_id, strahler) but last time I tried something like that Postgres was still choking on it 18 hours later. This way I can watch it as it goes.