Vector tile success!

I think I’ve finally wrangled my vector tile data problem, the one where my server was seeing rivers as lots of little tiny line segments instead of one giant linestring. It took me a long time to work this out; merging with ST_Union() at query time was too slow, I think because the subselect that TileStache issues confuses the gist. I tried creating a merged river database in one fell swoop with a select ST_Union(geometry)… group by gnis_id but that query didn’t complete after 18 hours, Postgres wasn’t using the gnis_id index correctly. I finally wrote a Python script to merge stuff one river at a time and create a new database that way and it runs in < 4 minutes. Some data:

  • 2,667,758 rows in the rivers database. Each one has a simple LineString for geometry.
  • 1,086,512 rows with a gnis_id. Another 1,581,246 where gnis_id is null. Those nulls seem to be causing a problem, see below. (There’s 1,899,204 gnis_id nulls in the source nhdflowline database; I must have tossed some when joining on comid to make the rivers table in the first place.)
  • 322,621 rows in the merged database (“rivers2”, for this discussion). Each has a MultiLineString for the geometry, just ST_Union().
  • 178,906 unique rivers in the dataset (identified by gnis_id). I don’t merge all this way in rivers2 because I want to keep them separated by Strahler number. Ie: the Missisippis has 7 rows in rivers2, the little tiny part at Strahler=2, the bigger part at Strahler=3, … all the way to Strahler=8 at the Gulf.

I later improved on things by making a rivers3, which is an ST_LineMerge(ST_Union(geometry)); that stitches together LineStrings where possible. Adds about 10% overhead to the table building, but significantly better results on serving.

One failure of this approach is gnis_id is sometimes null. Right now I’m just throwing those geometries away in the merging process and the result is broken lines, bits of rivers are missing. Visual inspection suggests its in places where gnis_id is null, and there’s a lot of those. I took a shot at a rivers4 where I just copied all those nulls in. It looks correct, which is a virtue, but we’re back to bloated data and query times. Need to dig more into NHDPlus and figure out why gnis_id is null in so many places, what to do about it.

Old way (rivers)
 7/25/49:    577 ms    258 kb   1715 features
 6/13/23:   2281 ms   1048 kb   6911 features
  5/6/11:   5421 ms   2549 kb  16953 features
  5/7/12:   6457 ms   2659 kb  17781 features
   4/3/6:  10359 ms   4855 kb  32439 features

New way (rivers2, with ST_Union)
 7/25/49:    436 ms     58 kb     25 features
 6/13/23:    874 ms    243 kb     90 features
  5/6/11:   1648 ms    538 kb     81 features
  5/7/12:   2236 ms    649 kb    162 features
   4/3/6:   3435 ms   1175 kb    275 features

Newer way (rivers3, with ST_LineMerge(ST_Union()))
 7/25/49:     48 ms      9 kb     25 features
 6/13/23:     86 ms     23 kb     90 features
  5/6/11:     89 ms     21 kb     81 features
  5/7/12:    162 ms     41 kb    162 features
   4/3/6:    216 ms     53 kb    275 features

Newerer way (rivers3, with nulls included)
 7/25/49:    244 ms     98 kb    613 features
 6/13/23:    814 ms    347 kb   2223 features
  5/6/11:   2017 ms    946 kb   6188 features
  5/7/12:   1648 ms    691 kb   4499 features
   4/3/6:   2934 ms   1316 kb   8664 features

Huge improvements, very happy.