Postgres bug report

Ugh, I was grooving along with a solution for my vector merging when suddenly a ordinary-working query threw the most bizarre error:

ERROR: could not open file “base/16384/24738_fsm”: Invalid argument

Googling around was useless, but I managed to get some help on IRC from anders and RhodiumToad on #postgres. Long story short, our guess is a call to free() is following a call to open(), and the free() is clobbering errno before Postgres checks on the errno from open(). Good ol’ C error reporting, have to love it :-(

It’s submitted as Postgres bug #8167. Lots of details at https://gist.github.com/NelsonMinar/5588719

This is the first time I’ve used dtruss in anger. Worked nicely. In my psql repl I did “select pg_backend_pid();” to find the PID, then did “dtruss -p > /tmp/out 2>&1” to capture the output.

Vector tile success!

I think I’ve finally wrangled my vector tile data problem, the one where my server was seeing rivers as lots of little tiny line segments instead of one giant linestring. It took me a long time to work this out; merging with ST_Union() at query time was too slow, I think because the subselect that TileStache issues confuses the gist. I tried creating a merged river database in one fell swoop with a select ST_Union(geometry)… group by gnis_id but that query didn’t complete after 18 hours, Postgres wasn’t using the gnis_id index correctly. I finally wrote a Python script to merge stuff one river at a time and create a new database that way and it runs in < 4 minutes. Some data:

  • 2,667,758 rows in the rivers database. Each one has a simple LineString for geometry.
  • 1,086,512 rows with a gnis_id. Another 1,581,246 where gnis_id is null. Those nulls seem to be causing a problem, see below. (There’s 1,899,204 gnis_id nulls in the source nhdflowline database; I must have tossed some when joining on comid to make the rivers table in the first place.)
  • 322,621 rows in the merged database (“rivers2”, for this discussion). Each has a MultiLineString for the geometry, just ST_Union().
  • 178,906 unique rivers in the dataset (identified by gnis_id). I don’t merge all this way in rivers2 because I want to keep them separated by Strahler number. Ie: the Missisippis has 7 rows in rivers2, the little tiny part at Strahler=2, the bigger part at Strahler=3, … all the way to Strahler=8 at the Gulf.

I later improved on things by making a rivers3, which is an ST_LineMerge(ST_Union(geometry)); that stitches together LineStrings where possible. Adds about 10% overhead to the table building, but significantly better results on serving.

One failure of this approach is gnis_id is sometimes null. Right now I’m just throwing those geometries away in the merging process and the result is broken lines, bits of rivers are missing. Visual inspection suggests its in places where gnis_id is null, and there’s a lot of those. I took a shot at a rivers4 where I just copied all those nulls in. It looks correct, which is a virtue, but we’re back to bloated data and query times. Need to dig more into NHDPlus and figure out why gnis_id is null in so many places, what to do about it.

Old way (rivers)
 7/25/49:    577 ms    258 kb   1715 features
 6/13/23:   2281 ms   1048 kb   6911 features
  5/6/11:   5421 ms   2549 kb  16953 features
  5/7/12:   6457 ms   2659 kb  17781 features
   4/3/6:  10359 ms   4855 kb  32439 features

New way (rivers2, with ST_Union)
 7/25/49:    436 ms     58 kb     25 features
 6/13/23:    874 ms    243 kb     90 features
  5/6/11:   1648 ms    538 kb     81 features
  5/7/12:   2236 ms    649 kb    162 features
   4/3/6:   3435 ms   1175 kb    275 features

Newer way (rivers3, with ST_LineMerge(ST_Union()))
 7/25/49:     48 ms      9 kb     25 features
 6/13/23:     86 ms     23 kb     90 features
  5/6/11:     89 ms     21 kb     81 features
  5/7/12:    162 ms     41 kb    162 features
   4/3/6:    216 ms     53 kb    275 features

Newerer way (rivers3, with nulls included)
 7/25/49:    244 ms     98 kb    613 features
 6/13/23:    814 ms    347 kb   2223 features
  5/6/11:   2017 ms    946 kb   6188 features
  5/7/12:   1648 ms    691 kb   4499 features
   4/3/6:   2934 ms   1316 kb   8664 features

Huge improvements, very happy.

Debugging vector tiles performance

My vector river tiles app is unacceptably slow when drawing maps way out, like at zoom=6. I know what the problem is; there’s just too many rivers and geometry. But I’d hoped that it would magically work anyway. Some notes on where the time is going.

First, a testing setup. This is remarkably awkward; there have never been good off the shelf tools for measuring database slowness :-( After a failed attempt to make TileStache use a LoggingCursor what I came up with is:

  • gunicorn configured for a single synchronous worker. It’s unrealistic, but much easier to follow.
  • TileStache configured to debug level INFO. It emits nice lines like
    INFO:root:TileStache.getTile() riverst/6/13/23.json via layer.render() in 2.090
  • Basic client test; load a map and when it gets slow, look in the TileStache logs for specific tiles that are taking a long time.
  • Reproducible client test: load a single tile
    /bin/time curl -s http://localhost:8000/riverst/6/13/23.json | wc -c
    Ensure that TileStache isn’t loading from cache.
  • Postgres configured with log_duration=on and log_statement=all. This gives you helpful log lines like
    LOG: statement: SELECT ….
    LOG: duration: 621.962 ms

By syncing these all up I know that the tile z=6 13/23 is taking 2.09s for TileStache to create, 620ms of which is the database. And the GeoJSON output is a megabyte. That’s already very useful data; I had no idea TileStache itself was doing so much work.

Now I can take that query and look at it. For instance, that one tile returns 6911 rows! This is the NHD schema biting me in the ass; rivers are broken up into a lot of little points. The Cheyenne River on that tile, for instance, is returning 448 different rows with little LineStrings of 2-30 points each. Way too much detail for such a broad overview tile.

Some basic timings for the slowest query I could find at various zoom levels.

Tile     Time    DBtime   Rows    GeoJSON size
7/25/49   510ms   157ms    1715     264,410
6/13/23  2090ms   620ms    6911   1,073,684
5/ 7/12  5375ms  1598ms   17781   2,723,314
4/ 3/ 6  9461ms  2940ms   32439   4,972,224

Looking at this data, I’m thinking the bigger problem may really be the GeoJSON size. The tile generation time is awful but then again there’s only a few thousand tiles at zoom 7-0, can just cache them all. But there’s no love in sending all those tiles to a web browser client! The chopped up geometry is a problem for the GeoJSON too; each little LineString gets encoded as a separate feature with its own properties, so the string “Cheyenne River” is in the GeoJSON tile 448 times. Really need to fix that.

Here’s the query TileStache issues for 4/3/6 as well as the “explain analyze” output:


SELECT q.*,
	                     ST_AsBinary(ST_Transform(ST_Intersection(ST_Simplify(q.geometry, 9783.94), ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913)), 4326)) AS geometry
	              FROM (
	                select geometry, name as n, strahler as s from rivers where strahler >= 6
	                ) AS q
	              WHERE ST_IsValid(q.geometry)
	                AND q.geometry && ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913)
	                AND ST_Intersects(q.geometry, ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-12523442.71, 2504688.54), ST_MakePoint(-10018754.17, 5009377.09)), 900913))

 Bitmap Heap Scan on rivers  (cost=9362.71..41858.70 rows=943 width=478) (actual time=176.078..2661.831 rows=32439 loops=1)
   Recheck Cond: ((strahler >= 6) AND (geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry))
   Rows Removed by Index Recheck: 65107
   Filter: (st_isvalid(geometry) AND _st_intersects(geometry, '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry))
   ->  BitmapAnd  (cost=9362.71..9362.71 rows=8487 width=0) (actual time=175.357..175.357 rows=0 loops=1)
         ->  Bitmap Index Scan on rivers_strahler_idx  (cost=0.00..1904.82 rows=102971 width=0) (actual time=10.056..10.056 rows=102825 loops=1)
               Index Cond: (strahler >= 6)
         ->  Bitmap Index Scan on rivers_geometry_gist  (cost=0.00..7457.17 rows=217628 width=0) (actual time=163.464..163.464 rows=754699 loops=1)
               Index Cond: ((geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry) AND (geometry && '010300002031BF0D000100000005000000EC51B856F6E267C152B81E45F81B4341EC51B856F6E267C15C8FC245F81B5341D7A37045F81B63C15C8FC245F81B5341D7A37045F81B63C152B81E45F81B4341EC51B856F6E267C152B81E45F81B4341'::geometry))
 Total runtime: 2666.323 ms

Update: I played around with merging geometries in the database by grouping on gnis_id. It’s got some problems, I think I broke the gist (every query is taking 35 seconds at any zoom level!) But the approach has some promise. At least it’s making for smaller outputs; 28k vs 264k for 7/25/49, 88k vs 1073k for 6/13/23.

explain analyze  SELECT q.*,
                     ST_AsBinary(ST_Transform(ST_Intersection(ST_Simplify(q.geometry, 2445.98), ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913)), 4326)) AS geometry
              FROM (
                select ST_LineMerge(ST_Union(geometry)) as geometry, max(name) as n, max(strahler) as s from r where strahler >= 5 group by (gnis_id, strahler)
                ) AS q
              WHERE ST_IsValid(q.geometry)
                AND q.geometry && ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913)
                AND ST_Intersects(q.geometry, ST_SetSRID(ST_MakeBox2D(ST_MakePoint(-11897270.58, 5009377.09), ST_MakePoint(-11271098.44, 5635549.22)), 900913))

 Subquery Scan on q  (cost=354441.41..380026.21 rows=10258 width=72) (actual time=2563.823..36912.769 rows=93 loops=1)
   ->  GroupAggregate  (cost=354441.41..377282.19 rows=10258 width=497) (actual time=2562.281..36041.899 rows=93 loops=1)
         Filter: (st_isvalid(st_linemerge(st_union(r.geometry))) AND (st_linemerge(st_union(r.geometry)) && '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry) AND (st_linemerge(st_union(r.geometry)) && '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry) AND _st_intersects(st_linemerge(st_union(r.geometry)), '010300002031BF0D000100000005000000295C8FD236B166C15C8FC245F81B5341295C8FD236B166C1E17A144E777F5541E17A144E777F65C1E17A144E777F5541E17A144E777F65C15C8FC245F81B5341295C8FD236B166C15C8FC245F81B5341'::geometry))
         Rows Removed by Filter: 3599
         ->  Sort  (cost=354441.41..354953.32 rows=204764 width=497) (actual time=2009.868..2244.670 rows=204904 loops=1)
               Sort Key: (ROW(r.gnis_id, r.strahler))
               Sort Method: external merge  Disk: 73896kB
               ->  Bitmap Heap Scan on r  (cost=3835.46..197798.53 rows=204764 width=497) (actual time=31.229..642.727 rows=204904 loops=1)
                     Recheck Cond: (strahler >= 5)
                     Rows Removed by Index Recheck: 1371179

TileStache: crappy CGI timings

I made the mistake of doing a bunch of testing of TileStache with the simple Python CGIHTTPServer. It’s a great little test apparatus, but it’s single threaded. Also it’s slow by itself.

  • CGIHTTPServer: 750ms to load a single tile, 10,000ms to load a whole map with 24 or so tiles
  • Gunicorn with 4 workers (sync): 650ms to load a single file, 2154ms to load a whole map.

Some comparisons on Gunicorn settings to load a map with 24 tiles.

  • 1 worker, default sync worker: 4800ms
  • 2 workers, default sync worker: 2500ms
  • 4 workers, default sync worker: 2300ms
  • 24 workers, default sync worker: 1700ms

Interestingly the gevent worker seems slower, also more variable. OTOH it’s necessary if you’re serving GUnicorn direct to the web, without a proxy in front.

The backend server is mostly waiting on IO to Postgres, it’s no surprise doing that in parallel will be better. All on a single 4 core 2012 iMac with some SSD-backed storage. I should add I’m doing this all without benefit of any caching, so it’s all kind of fake.

 

Vector tile river maps

Some notes for a quick project I did with Mike Migurski’s help, making a map of California rivers using vector tiles. The river data is from NHDPlus, about 180 megs of shapefile, imported into PostGIS. It’s served as manageably small GeoJSON tiles via a TileStache VecTiles provider which slices up the data like raster map tiles. Rendered for now in Polymaps. Works well!

TileStache is already smart about simplifying the geometries it returns to roughly pixel precision, depending on tile zoom level. It’s really pretty great. But I also want to serve fewer rivers at larger scales, only show the important rivers. Need a classifier. I don’t think the propery I need is in NHDFlowlines but may be in NHDPlusAttributes, more thoughts in this GIS tutorial. Seth F advises “PlusFlowlineVAA has value-added attributes that can be joined on ComId. streamorder is the column you’re looking for. It’s the Strahler number (http://en.wikipedia.org/wiki/Strahler_number) for a given section of river.”

I ran into some problems with queries from GDAL not working with PostGIS 2.0. Used to be you could just call functions like GetSRID() etc, now you have to do ST_GetSRID(). The old GDAL 1.7 on my Debian box doesn’t know that yet and the Vector provider in TileStache relies on it, so it doesn’t work. VecTiles generates its own queries and Mike was kind enough to patch his code to use the ST_ names.

The TileStache VecTiles code only works with a spherical mercator projection, so I had to do this in PostGIS to add a reprojected column:

select AddGeometryColumn('nhdflowline', 
  'sphgeometry', 900913, 'MULTILINESTRING', 2)

update nhdflowline
set sphgeometry = 
  ST_Transform(ST_Force_2D(geom), 900913);

Here’s the TileStache config for serving vector tiles from PostGIS:

    &quot;riverst&quot;: {
      &quot;provider&quot;: {
      &quot;class&quot;: &quot;TileStache.Goodies.VecTiles:Provider&quot;,
        &quot;kwargs&quot;: {
              &quot;dbinfo&quot;: {
                &quot;host&quot;: &quot;localhost&quot;,
                &quot;user&quot;: &quot;nelson&quot;,
                &quot;password&quot;: &quot;OMGSECRET&quot;,
                &quot;database&quot;: &quot;nhd&quot;
              },
              &quot;queries&quot;: [
                &quot;select sphgeometry as geometry from nhdflowline&quot;
              ],
              &quot;clip&quot;: true
        }}}}

Some screenshots; rivers on top of the ESRI relief layer, also the “first light” with a hysterical bug where the river paths are filled, not stroked.

Screen Shot 2013-04-12 at 2.06.57 PM

Screen Shot 2013-04-12 at 2.11.40 PM