Nelson's log

Notes on NHDPlus contents, grouping criteria

Basic stats about the tables from NHD I’m working with



Pages 101 and following of the NHDPlusV2 User Guide are very useful in understanding how the PlusFlowlineVAA data is calculated.


The reach code is an alternate ID for a flow that hierarchically encodes its watershed. For instance the South Yuba River has reach codes like 18020125000030, 18020125000031, etc. The first 8 characters of a reach code are separable and are either identical to or closely related to the HUC8 that the WBD datasets use. Grouping flows by HUC8 makes a lot of sense, they are geographically clustered and related. Note that reachcode shows up both in nhdflowline and plusflowlinevaa; I’m not sure if they are identical or not.

Grouping criteria

So grouping by gnis_id isn’t sufficient because so many flows have that as null. How about grouping by some combination of a substring of reachcode, gnis_id, and strahler number? Here’s the sizes of various grouping options:

Based on that analysis, the HUC8 is the right choice. There’s a reasonable number of groups, no group is too enormous, and HUC8 is a meaningful thing to group on.

So now for every single HUC8, I’m executing this insert:

insert into merged_rivers(gnis_id, name, strahler, huc8, geometry)
  MAX(gnis_id) as gnis_id,
  MAX(name) as name,
  MAX(huc8) as huc8,
  ST_LineMerge(ST_Union(geometry)) as geometry
from rivers
where huc8 = %s
group by (gnis_id,strahler)

I’d rather do this as one big create table with a group by (huc8, gnis_id, strahler) but last time I tried something like that Postgres was still choking on it 18 hours later. This way I can watch it as it goes.