Basic stats about the tables from NHD I’m working with
- 2,993,526 rows
- 2,993,525 distinct ComIDs
(the duplicate is ComID 17029298, Squaw Creek, and it’s not quite a duplicate row.)
- 2,527,374 distinct ReachCodes
- 181,685 distinct GNIS_IDs
- 70,822 distinct GNIS_names
- Almost all the flowlines are < 5km in length
Pages 101 and following of the NHDPlusV2 User Guide are very useful in understanding how the PlusFlowlineVAA data is calculated.
- 2,691,346 rows (one for each flowdir == “with digitized”, according to docs)
- 2,691,345 distinct ComIDs (17029298 is duplicated again)
- 2,691,346 HydroSeqs. This number acts like a flowline ID for a lot of the VAA data, with the added advantage that the numbers are sorted upstream to downstream.
- 2,626,049 distinct FromNodes
- 1,686,246 distinct ToNodes
- 1,013,062 distinct LevelPathIDs
- 14,676 distinct TerminalPathIDs. None are null.
- There are 663,408 rows with gnis_id of null and a terminalpathid of 350002977. I’m guessing these are poorly labelled tributaries for the Mississippi, but I’m not sure. It’s too many to merge!
The reach code is an alternate ID for a flow that hierarchically encodes its watershed. For instance the South Yuba River has reach codes like 18020125000030, 18020125000031, etc. The first 8 characters of a reach code are separable and are either identical to or closely related to the HUC8 that the WBD datasets use. Grouping flows by HUC8 makes a lot of sense, they are geographically clustered and related. Note that reachcode shows up both in nhdflowline and plusflowlinevaa; I’m not sure if they are identical or not.
- There are 2,239,489 distinct reachcodes in PlusFlowlinesVAA. The biggest is 10150005000096 which matches 100 rows.
- There are 2117 distinct HUC8s in the reachcodes. The biggest HUC8 is 09030001 which has 5784 rows in it.
- There are 32,439 distinct 12 character prefixes for reachcodes. The biggest one 020801110150 has 457 rows. This seems like the right size of blob for me to merge geometry on. I’m not positive it’s really correct to group by this 12 character prefix; while you see HUC12 hierarchical data in watershed boundaries, I’m not certain these reachcodes are doing the same thing. OTOH our grouping criterion can be a bit arbitrary; all we really need is some geographic locality.
So grouping by gnis_id isn’t sufficient because so many flows have that as null. How about grouping by some combination of a substring of reachcode, gnis_id, and strahler number? Here’s the sizes of various grouping options:
- 322,625 groups, biggest is 1153 rows
where gnis_id is not null group by (gnis_id, strahler)
this is the flawed grouping I was using. It leaves out 1.5M flows where gnis_id is null
- 333,405 groups, biggest is 2882 rows
group by (substring(reachcode from 1 for 8),gnis_id,strahler)
Grouping by HUC8, gnis_id, and Strahler number. Most “correct”?
- 454,018 groups, biggest is 391 rows
group by (substring(reachcode from 1 for 12),gnis_id,strahler)
Grouping by first 12 of reachcode, gnis_is, strahler.
Based on that analysis, the HUC8 is the right choice. There’s a reasonable number of groups, no group is too enormous, and HUC8 is a meaningful thing to group on.
So now for every single HUC8, I’m executing this insert:
insert into merged_rivers(gnis_id, name, strahler, huc8, geometry)
MAX(gnis_id) as gnis_id,
MAX(name) as name,
MAX(huc8) as huc8,
ST_LineMerge(ST_Union(geometry)) as geometry
where huc8 = %s
group by (gnis_id,strahler)
I’d rather do this as one big create table with a group by (huc8, gnis_id, strahler) but last time I tried something like that Postgres was still choking on it 18 hours later. This way I can watch it as it goes.