I think I’ve finally wrangled my vector tile data problem, the one where my server was seeing rivers as lots of little tiny line segments instead of one giant linestring. It took me a long time to work this out; merging with ST_Union() at query time was too slow, I think because the subselect that TileStache issues confuses the gist. I tried creating a merged river database in one fell swoop with a select ST_Union(geometry)… group by gnis_id but that query didn’t complete after 18 hours, Postgres wasn’t using the gnis_id index correctly. I finally wrote a Python script to merge stuff one river at a time and create a new database that way and it runs in < 4 minutes. Some data:
- 2,667,758 rows in the rivers database. Each one has a simple LineString for geometry.
- 1,086,512 rows with a gnis_id. Another 1,581,246 where gnis_id is null. Those nulls seem to be causing a problem, see below. (There’s 1,899,204 gnis_id nulls in the source nhdflowline database; I must have tossed some when joining on comid to make the rivers table in the first place.)
- 322,621 rows in the merged database (“rivers2”, for this discussion). Each has a MultiLineString for the geometry, just ST_Union().
- 178,906 unique rivers in the dataset (identified by gnis_id). I don’t merge all this way in rivers2 because I want to keep them separated by Strahler number. Ie: the Missisippis has 7 rows in rivers2, the little tiny part at Strahler=2, the bigger part at Strahler=3, … all the way to Strahler=8 at the Gulf.
I later improved on things by making a rivers3, which is an ST_LineMerge(ST_Union(geometry)); that stitches together LineStrings where possible. Adds about 10% overhead to the table building, but significantly better results on serving.
One failure of this approach is gnis_id is sometimes null. Right now I’m just throwing those geometries away in the merging process and the result is broken lines, bits of rivers are missing. Visual inspection suggests its in places where gnis_id is null, and there’s a lot of those. I took a shot at a rivers4 where I just copied all those nulls in. It looks correct, which is a virtue, but we’re back to bloated data and query times. Need to dig more into NHDPlus and figure out why gnis_id is null in so many places, what to do about it.
Old way (rivers) 7/25/49: 577 ms 258 kb 1715 features 6/13/23: 2281 ms 1048 kb 6911 features 5/6/11: 5421 ms 2549 kb 16953 features 5/7/12: 6457 ms 2659 kb 17781 features 4/3/6: 10359 ms 4855 kb 32439 features New way (rivers2, with ST_Union) 7/25/49: 436 ms 58 kb 25 features 6/13/23: 874 ms 243 kb 90 features 5/6/11: 1648 ms 538 kb 81 features 5/7/12: 2236 ms 649 kb 162 features 4/3/6: 3435 ms 1175 kb 275 features Newer way (rivers3, with ST_LineMerge(ST_Union())) 7/25/49: 48 ms 9 kb 25 features 6/13/23: 86 ms 23 kb 90 features 5/6/11: 89 ms 21 kb 81 features 5/7/12: 162 ms 41 kb 162 features 4/3/6: 216 ms 53 kb 275 features Newerer way (rivers3, with nulls included) 7/25/49: 244 ms 98 kb 613 features 6/13/23: 814 ms 347 kb 2223 features 5/6/11: 2017 ms 946 kb 6188 features 5/7/12: 1648 ms 691 kb 4499 features 4/3/6: 2934 ms 1316 kb 8664 features
Huge improvements, very happy.