Did a bit of a goof project, mashing up a bunch of vector tile layers with crazy colors. It looks good! Also gave me a chance to rework Ziggy Jonsson’s D3+Leaflet tile renderer; TileLayer.d3_geoJson is promising. It’s certaintly a heck of a lot faster than Leaflet’s GeoJSON rendering, although there are some issues with the APIs.
Notes from developing the Vector River Map.
Vector tile page sizes
I instrumented the Leaflet version of my vector tile client to show me the size of the vector tiles for the page. Chrome developer tools will do this measurement, but it’s easier to have it in console.log. All measures are uncompressed and uncached times, and for my relatively limited set of rivers (based on Strahler filtering). I intend to add more data to the vector tiles; measuring this as a baseline.
All measurements taken with a browser window about 1000 x 1000, loading about 25 tiles for the view. I slowly zoomed in on St. Louis, a fairly dense area to load.
Size Time View Cutoff 1113kB 1237ms 4/38.89/-98.75 strahler >= 6 1117kb 1494ms 5/37.892/-94.241 1741kb 1798ms 6/38.100/-92.230 strahler >= 5 830kb 1047ms 7/38.376/-91.214 334kb 479ms 8/38.565/-90.722 283kb 427ms 9/38.6340/-90.4738 strahler >= 4 239kb 332ms 10/38.7701/-90.3962 strahler >= 3 90kb 233ms 11/38.8445/-90.5332 60kb 301ms 12/38.8691/-90.5842 strahler >= 2 53kb 686ms 13/38.8697/-90.6169 strahler >= 1
I guess I knew that the map got lighter weight as I zoomed in, but not how much. Too bad; I’m most excited to load data in the zoomed out views. But looking at this I can move the cutoffs some and show more rivers. Update: I updated some of the thresholds around z=8 to z=13.
A large tile like 4/3/5.json gzips from 384k to 92k. A smaller tile like 8/40/97.json gzips from 29k to 7k. So let’s say GeoJSON gzips to 25% the original size; that means the biggest views will take about 500k of data on the wire to download. That’s pretty reasonable.
A whole separate question is render time. The Leaflet render is really bad and I don’t know why. The new D3 client is so fast it gives me some aspiration.
TopoJSON for rivers
Mike Bostock took a crack at using TopoJSON to encode the NHDFlowline dataset. Just the geometry for rivers in 2 dimensions; no properties, etc. Tested just for California. All sizes are one million byte megabytes.
- Source shapefile: 132M, 72M gzipped.
- Naive GeoJSON conversion: 184M, 56M gzipped.
ogr2ogr -dim 2 -f GeoJSON -select ” new.geojson NHDFlowline.shp
- GeoJSON rounded to 5 digits: 95M, 21M gzipped.
liljson -p 5 < new.geojson
- GeoJSON rounded to 3 digits: 80M, 9M gzipped.
liljson -p 3 < new.geojson
- TopoJSON: 20M, 3.3M gzipped.
So for this data TopoJSON is about 1/4 – 1/5 the size of the equivalent GeoJSON. And those advantages persist through gzip. And that’s for a pathologically bad case, where there’s no shared topology along polygon boundaries. Pretty much all the savings here must be coming from the delta encoding. Neat!
Update: Mike converted the whole US. 2.5G of .shp file input, 327M of topojson output.
Note that TopoJSON quantizes. Mike used the default TopoJSON settings which I think work out to about 10,000 x 10,000 resolution, which makes the comparison to GeoJSON rounding to 3 digits about fair. Here’s a snapshot of a render of the TopoJSON that Mike gave me. It looks right.
I love this one screenshot from when I first got nice rendering implemented, showing all sorts of detailed rivers. I’ve backed way off that because it’s so much data, etc. But in filtering so much I’ve lost something.
Mike Bostock did a pure-D3 vector tile renderer, so I bashed up a river renderer. And it’s fast. So I made some screenshots where I don’t filter the data on Strahler number at all; all rivers, all the time!
The server generation time is acceptable once you rely on caching. Data sizes are pretty bad though, 27 megs a tile :-P
7/25/49: 1684 ms 640 kb 1928 features 6/13/23: 6243 ms 2308 kb 5405 features 5/6/11: 21221 ms 8841 kb 24139 features 5/7/12: 37041 ms 15146 kb 42131 features 4/3/6: 64572 ms 26870 kb 75744 features
Yet more merge strategy timings
To complement my previous timings, some more measurements of new merge strategies.
Merge by (huc8, gnis_id, strahler) 7/25/49: 275 ms 30 kb 50 features 6/13/23: 453 ms 95 kb 148 features 5/6/11: 783 ms 209 kb 238 features 5/7/12: 1048 ms 202 kb 434 features 4/3/6: 1333 ms 376 kb 783 features Merge by (gnis_id, strahler) if gnis_id is not null, otherwise merge by (huc8, strahler) 7/25/49: 103 ms 29 kb 45 features 6/13/23: 262 ms 91 kb 131 features 5/6/11: 543 ms 199 kb 178 features 5/7/12: 534 ms 186 kb 339 features 4/3/6: 958 ms 346 kb 600 features
That second strategy seems like the right compromise. The performance numbers are pretty good, the map is accurate, and the actual SQL queries are reasonably simple. It takes 11 minutes for me to merge all 2.7M rivers this way. The resulting table has 332,615 rows in it; that’s 322,621 rows with a gnis_id of some sort and 9994 without.
I sure wish Postgres could do this merging with a single query. The query planner seems to be doing the wrong thing and Postgres infamously has no syntax for giving hints to it. So I issue 170,000 queries, one per gnis_id, instead of doing a single insert. Here’s the query that doesn’t run in a reasonable time, combined with the explain for it. I have indices on gnis_id and strahler both, but maybe the group by on the pair confuses it.
insert into m(gnis_id, name, strahler, huc8, geometry) select MAX(gnis_id) as gnis_id, MAX(name) as name, MAX(strahler) as strahler, MIN(huc8) as huc8, ST_LineMerge(ST_Union(geometry)) as geometry from rivers group by (gnis_id, strahler); QUERY PLAN ------------------------------------------------------------------------------------------- Insert on m (cost=2863631.25..2944365.87 rows=124821 width=102) -> Subquery Scan on "*SELECT*" (cost=2863631.25..2944365.87 rows=124821 width=102) -> GroupAggregate (cost=2863631.25..2943117.66 rows=124821 width=491) -> Sort (cost=2863631.25..2870305.66 rows=2669766 width=491) Sort Key: (ROW(rivers.gnis_id, rivers.strahler)) -> Seq Scan on rivers (cost=0.00..206104.66 rows=2669766 width=491) (6 rows)
River merges, visualized
Notes on NHDPlus contents, grouping criteria
Basic stats about the tables from NHD I’m working with
- 2,993,526 rows
- 2,993,525 distinct ComIDs
(the duplicate is ComID 17029298, Squaw Creek, and it’s not quite a duplicate row.)
- 2,527,374 distinct ReachCodes
- 181,685 distinct GNIS_IDs
- 70,822 distinct GNIS_names
- Almost all the flowlines are < 5km in length
Pages 101 and following of the NHDPlusV2 User Guide are very useful in understanding how the PlusFlowlineVAA data is calculated.
- 2,691,346 rows (one for each flowdir == “with digitized”, according to docs)
- 2,691,345 distinct ComIDs (17029298 is duplicated again)
- 2,691,346 HydroSeqs. This number acts like a flowline ID for a lot of the VAA data, with the added advantage that the numbers are sorted upstream to downstream.
- 2,626,049 distinct FromNodes
- 1,686,246 distinct ToNodes
- 1,013,062 distinct LevelPathIDs
- 14,676 distinct TerminalPathIDs. None are null.
- There are 663,408 rows with gnis_id of null and a terminalpathid of 350002977. I’m guessing these are poorly labelled tributaries for the Mississippi, but I’m not sure. It’s too many to merge!
The reach code is an alternate ID for a flow that hierarchically encodes its watershed. For instance the South Yuba River has reach codes like 18020125000030, 18020125000031, etc. The first 8 characters of a reach code are separable and are either identical to or closely related to the HUC8 that the WBD datasets use. Grouping flows by HUC8 makes a lot of sense, they are geographically clustered and related. Note that reachcode shows up both in nhdflowline and plusflowlinevaa; I’m not sure if they are identical or not.
- There are 2,239,489 distinct reachcodes in PlusFlowlinesVAA. The biggest is 10150005000096 which matches 100 rows.
- There are 2117 distinct HUC8s in the reachcodes. The biggest HUC8 is 09030001 which has 5784 rows in it.
- There are 32,439 distinct 12 character prefixes for reachcodes. The biggest one 020801110150 has 457 rows. This seems like the right size of blob for me to merge geometry on. I’m not positive it’s really correct to group by this 12 character prefix; while you see HUC12 hierarchical data in watershed boundaries, I’m not certain these reachcodes are doing the same thing. OTOH our grouping criterion can be a bit arbitrary; all we really need is some geographic locality.
So grouping by gnis_id isn’t sufficient because so many flows have that as null. How about grouping by some combination of a substring of reachcode, gnis_id, and strahler number? Here’s the sizes of various grouping options:
- 322,625 groups, biggest is 1153 rows
where gnis_id is not null group by (gnis_id, strahler)
this is the flawed grouping I was using. It leaves out 1.5M flows where gnis_id is null
- 333,405 groups, biggest is 2882 rows
group by (substring(reachcode from 1 for 8),gnis_id,strahler)
Grouping by HUC8, gnis_id, and Strahler number. Most “correct”?
- 454,018 groups, biggest is 391 rows
group by (substring(reachcode from 1 for 12),gnis_id,strahler)
Grouping by first 12 of reachcode, gnis_is, strahler.
Based on that analysis, the HUC8 is the right choice. There’s a reasonable number of groups, no group is too enormous, and HUC8 is a meaningful thing to group on.
So now for every single HUC8, I’m executing this insert:
insert into merged_rivers(gnis_id, name, strahler, huc8, geometry) select MAX(gnis_id) as gnis_id, MAX(name) as name, MAX(strahler), MAX(huc8) as huc8, ST_LineMerge(ST_Union(geometry)) as geometry from rivers where huc8 = %s group by (gnis_id,strahler)
I’d rather do this as one big create table with a group by (huc8, gnis_id, strahler) but last time I tried something like that Postgres was still choking on it 18 hours later. This way I can watch it as it goes.
Postgres bug report
Ugh, I was grooving along with a solution for my vector merging when suddenly a ordinary-working query threw the most bizarre error:
ERROR: could not open file “base/16384/24738_fsm”: Invalid argument
Googling around was useless, but I managed to get some help on IRC from anders and RhodiumToad on #postgres. Long story short, our guess is a call to free() is following a call to open(), and the free() is clobbering errno before Postgres checks on the errno from open(). Good ol’ C error reporting, have to love it :-(
It’s submitted as Postgres bug #8167. Lots of details at https://gist.github.com/NelsonMinar/5588719
This is the first time I’ve used dtruss in anger. Worked nicely. In my psql repl I did “select pg_backend_pid();” to find the PID, then did “dtruss -p > /tmp/out 2>&1” to capture the output.
Vector tile success!
I think I’ve finally wrangled my vector tile data problem, the one where my server was seeing rivers as lots of little tiny line segments instead of one giant linestring. It took me a long time to work this out; merging with ST_Union() at query time was too slow, I think because the subselect that TileStache issues confuses the gist. I tried creating a merged river database in one fell swoop with a select ST_Union(geometry)… group by gnis_id but that query didn’t complete after 18 hours, Postgres wasn’t using the gnis_id index correctly. I finally wrote a Python script to merge stuff one river at a time and create a new database that way and it runs in < 4 minutes. Some data:
- 2,667,758 rows in the rivers database. Each one has a simple LineString for geometry.
- 1,086,512 rows with a gnis_id. Another 1,581,246 where gnis_id is null. Those nulls seem to be causing a problem, see below. (There’s 1,899,204 gnis_id nulls in the source nhdflowline database; I must have tossed some when joining on comid to make the rivers table in the first place.)
- 322,621 rows in the merged database (“rivers2”, for this discussion). Each has a MultiLineString for the geometry, just ST_Union().
- 178,906 unique rivers in the dataset (identified by gnis_id). I don’t merge all this way in rivers2 because I want to keep them separated by Strahler number. Ie: the Missisippis has 7 rows in rivers2, the little tiny part at Strahler=2, the bigger part at Strahler=3, … all the way to Strahler=8 at the Gulf.
I later improved on things by making a rivers3, which is an ST_LineMerge(ST_Union(geometry)); that stitches together LineStrings where possible. Adds about 10% overhead to the table building, but significantly better results on serving.
One failure of this approach is gnis_id is sometimes null. Right now I’m just throwing those geometries away in the merging process and the result is broken lines, bits of rivers are missing. Visual inspection suggests its in places where gnis_id is null, and there’s a lot of those. I took a shot at a rivers4 where I just copied all those nulls in. It looks correct, which is a virtue, but we’re back to bloated data and query times. Need to dig more into NHDPlus and figure out why gnis_id is null in so many places, what to do about it.
Old way (rivers) 7/25/49: 577 ms 258 kb 1715 features 6/13/23: 2281 ms 1048 kb 6911 features 5/6/11: 5421 ms 2549 kb 16953 features 5/7/12: 6457 ms 2659 kb 17781 features 4/3/6: 10359 ms 4855 kb 32439 features New way (rivers2, with ST_Union) 7/25/49: 436 ms 58 kb 25 features 6/13/23: 874 ms 243 kb 90 features 5/6/11: 1648 ms 538 kb 81 features 5/7/12: 2236 ms 649 kb 162 features 4/3/6: 3435 ms 1175 kb 275 features Newer way (rivers3, with ST_LineMerge(ST_Union())) 7/25/49: 48 ms 9 kb 25 features 6/13/23: 86 ms 23 kb 90 features 5/6/11: 89 ms 21 kb 81 features 5/7/12: 162 ms 41 kb 162 features 4/3/6: 216 ms 53 kb 275 features Newerer way (rivers3, with nulls included) 7/25/49: 244 ms 98 kb 613 features 6/13/23: 814 ms 347 kb 2223 features 5/6/11: 2017 ms 946 kb 6188 features 5/7/12: 1648 ms 691 kb 4499 features 4/3/6: 2934 ms 1316 kb 8664 features
Huge improvements, very happy.
Debugging vector tiles performance
My vector river tiles app is unacceptably slow when drawing maps way out, like at zoom=6. I know what the problem is; there’s just too many rivers and geometry. But I’d hoped that it would magically work anyway. Some notes on where the time is going.
First, a testing setup. This is remarkably awkward; there have never been good off the shelf tools for measuring database slowness :-( After a failed attempt to make TileStache use a LoggingCursor what I came up with is:
- gunicorn configured for a single synchronous worker. It’s unrealistic, but much easier to follow.
- TileStache configured to debug level INFO. It emits nice lines like
INFO:root:TileStache.getTile() riverst/6/13/23.json via layer.render() in 2.090
- Basic client test; load a map and when it gets slow, look in the TileStache logs for specific tiles that are taking a long time.
- Reproducible client test: load a single tile
/bin/time curl -s http://localhost:8000/riverst/6/13/23.json | wc -c
Ensure that TileStache isn’t loading from cache.
- Postgres configured with log_duration=on and log_statement=all. This gives you helpful log lines like
LOG: statement: SELECT ….
LOG: duration: 621.962 ms
By syncing these all up I know that the tile z=6 13/23 is taking 2.09s for TileStache to create, 620ms of which is the database. And the GeoJSON output is a megabyte. That’s already very useful data; I had no idea TileStache itself was doing so much work.
Now I can take that query and look at it. For instance, that one tile returns 6911 rows! This is the NHD schema biting me in the ass; rivers are broken up into a lot of little points. The Cheyenne River on that tile, for instance, is returning 448 different rows with little LineStrings of 2-30 points each. Way too much detail for such a broad overview tile.
Some basic timings for the slowest query I could find at various zoom levels.
Tile Time DBtime Rows GeoJSON size 7/25/49 510ms 157ms 1715 264,410 6/13/23 2090ms 620ms 6911 1,073,684 5/ 7/12 5375ms 1598ms 17781 2,723,314 4/ 3/ 6 9461ms 2940ms 32439 4,972,224
Looking at this data, I’m thinking the bigger problem may really be the GeoJSON size. The tile generation time is awful but then again there’s only a few thousand tiles at zoom 7-0, can just cache them all. But there’s no love in sending all those tiles to a web browser client! The chopped up geometry is a problem for the GeoJSON too; each little LineString gets encoded as a separate feature with its own properties, so the string “Cheyenne River” is in the GeoJSON tile 448 times. Really need to fix that.
Here’s the query TileStache issues for 4/3/6 as well as the “explain analyze” output:
SELECT q.*, ST_AsBinary(ST_Transform(ST_Intersection(ST_Simplify(q.geometry, 9783.94), ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913)), 4326)) AS geometry FROM ( select geometry, name as n, strahler as s from rivers where strahler >= 6 ) AS q WHERE ST_IsValid(q.geometry) AND q.geometry && ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913) AND ST_Intersects(q.geometry, ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913)) Bitmap Heap Scan on rivers (cost=9362.71..41858.70 rows=943 width=478) (actual time=176.078..2661.831 rows=32439 loops=1) Recheck Cond: ((strahler >= 6) AND (geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry)) Rows Removed by Index Recheck: 65107 Filter: (st_isvalid(geometry) AND _st_intersects(geometry, '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry)) -> BitmapAnd (cost=9362.71..9362.71 rows=8487 width=0) (actual time=175.357..175.357 rows=0 loops=1) -> Bitmap Index Scan on rivers_strahler_idx (cost=0.00..1904.82 rows=102971 width=0) (actual time=10.056..10.056 rows=102825 loops=1) Index Cond: (strahler >= 6) -> Bitmap Index Scan on rivers_geometry_gist (cost=0.00..7457.17 rows=217628 width=0) (actual time=163.464..163.464 rows=754699 loops=1) Index Cond: ((geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry) AND (geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry)) Total runtime: 2666.323 ms
Update: I played around with merging geometries in the database by grouping on gnis_id. It’s got some problems, I think I broke the gist (every query is taking 35 seconds at any zoom level!) But the approach has some promise. At least it’s making for smaller outputs; 28k vs 264k for 7/25/49, 88k vs 1073k for 6/13/23.
explain analyze SELECT q.*, ST_AsBinary(ST_Transform(ST_Intersection(ST_Simplify(q.geometry, 2445.98), ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913)), 4326)) AS geometry FROM ( select ST_LineMerge(ST_Union(geometry)) as geometry, max(name) as n, max(strahler) as s from r where strahler >= 5 group by (gnis_id, strahler) ) AS q WHERE ST_IsValid(q.geometry) AND q.geometry && ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913) AND ST_Intersects(q.geometry, ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913)) Subquery Scan on q (cost=354441.41..380026.21 rows=10258 width=72) (actual time=2563.823..36912.769 rows=93 loops=1) -> GroupAggregate (cost=354441.41..377282.19 rows=10258 width=497) (actual time=2562.281..36041.899 rows=93 loops=1) Filter: (st_isvalid(st_linemerge(st_union(r.geometry))) AND (st_linemerge(st_union(r.geometry)) && '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry) AND (st_linemerge(st_union(r.geometry)) && '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry) AND _st_intersects(st_linemerge(st_union(r.geometry)), '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry)) Rows Removed by Filter: 3599 -> Sort (cost=354441.41..354953.32 rows=204764 width=497) (actual time=2009.868..2244.670 rows=204904 loops=1) Sort Key: (ROW(r.gnis_id, r.strahler)) Sort Method: external merge Disk: 73896kB -> Bitmap Heap Scan on r (cost=3835.46..197798.53 rows=204764 width=497) (actual time=31.229..642.727 rows=204904 loops=1) Recheck Cond: (strahler >= 5) Rows Removed by Index Recheck: 1371179