Yet more merge strategy timings

To complement my previous timings, some more measurements of new merge strategies.

Merge by (huc8, gnis_id, strahler)
 7/25/49:    275 ms     30 kb     50 features
 6/13/23:    453 ms     95 kb    148 features
  5/6/11:    783 ms    209 kb    238 features
  5/7/12:   1048 ms    202 kb    434 features
   4/3/6:   1333 ms    376 kb    783 features

Merge by (gnis_id, strahler) if gnis_id is not null,
otherwise merge by (huc8, strahler)
 7/25/49:    103 ms     29 kb     45 features
 6/13/23:    262 ms     91 kb    131 features
  5/6/11:    543 ms    199 kb    178 features
  5/7/12:    534 ms    186 kb    339 features
   4/3/6:    958 ms    346 kb    600 features

That second strategy seems like the right compromise. The performance numbers are pretty good, the map is accurate, and the actual SQL queries are reasonably simple. It takes 11 minutes for me to merge all 2.7M rivers this way. The resulting table has 332,615 rows in it; that’s 322,621 rows with a gnis_id of some sort and 9994 without.

I sure wish Postgres could do this merging with a single query. The query planner seems to be doing the wrong thing and Postgres infamously has no syntax for giving hints to it. So I issue 170,000 queries, one per gnis_id, instead of doing a single insert. Here’s the query that doesn’t run in a reasonable time, combined with the explain for it. I have indices on gnis_id and strahler both, but maybe the group by on the pair confuses it.

insert into m(gnis_id, name, strahler, huc8, geometry)
         MAX(gnis_id) as gnis_id,
         MAX(name) as name,
         MAX(strahler) as strahler,
         MIN(huc8) as huc8,
         ST_LineMerge(ST_Union(geometry)) as geometry
       from rivers
       group by (gnis_id, strahler);

                                        QUERY PLAN
 Insert on m  (cost=2863631.25..2944365.87 rows=124821 width=102)
   ->  Subquery Scan on "*SELECT*"  (cost=2863631.25..2944365.87 rows=124821 width=102)
         ->  GroupAggregate  (cost=2863631.25..2943117.66 rows=124821 width=491)
               ->  Sort  (cost=2863631.25..2870305.66 rows=2669766 width=491)
                     Sort Key: (ROW(rivers.gnis_id, rivers.strahler))
                     ->  Seq Scan on rivers  (cost=0.00..206104.66 rows=2669766 width=491)
(6 rows)