InfluxDB 1.8 schemas

I’m using Influx more and more for system monitoring. I’m setting up a new system to monitor my solar system and realized I don’t understand Influx data schema. Here’s my notes, based on the docs. This is for Influx 1.8; I think it’s broadly the same in Influx 2 but I’m not sure. Their docs are really quite good but I had to summarize it here to understand it.

Influx schema concepts

I’m going to follow those docs’ example: you have two scientists (langstroth and perpetua) conducting a census, recording observations in two locations (1 and 2). Each observation records two things: the count of butterflies and the count of honeybees. Observations are timestamped. Note the term observation is something I made up; it’s not an InfluxDB term. The question is how you store a bunch of observations in Influx.

Influx feels a little free-form and schemaless but really it’s pretty structured. Each observation is a row. Every row has a bunch of columns in it that boil down to key/value pairs. Rows can have different numbers of columns in them. Here’s all the components of an Influx schema

  • Measurements. These are more or less like tables in databases. You have one Influx measurement for each observation. In a traditional relational database you’d have a table per kind of data and a row for each observation. In our example there’s one measurement called “census” that records all the observations.
  • Timestamp. An integer, in nanoseconds. Signed 64 bit numbers from the years 1677 to 2262. In a table database you’d have a column named timestamp. It’s probably indexed.
  • Fields. These are the measured data for each observation. Fields have keys (names like “butterflies”) and values (numbers like 12). Field values can be strings, floats, integers, or booleans. In a table database you’d have one column per field with the column name being the field key and the values being the field values. There’s probably a lot of different values for field values. Field values are not indexed.
  • Tags. These are labels for the observations. Tags also have keys (names like “scientist” or “location”) and values (strings like “langstroth” or “location 1”). In a relational database you’d also have one column per tag, named by the tag key and with the tag value stored in the row. There’s probably only a few values for tag values. Tag values are indexed.
  • Series. A series is the name for all of the measurements that match the same measurement name and tag values. A series isn’t a thing you create explicitly, it’s implied by the set of values of tags. Ie: one series is “all of langstroth’s observations in location 1.” There’s no direct analog for a series in a relational database, but it’s the kind of data you’d regularly query.
    select timestamp, butterflies, honeybees from census
    where scientist = langstroth and location = 1

InfluxDB is schemaless in that you can create new series without having to define them first. Insert some new data with a new measurement name and tag values and it will make a new series. Each measurement can have its own collection of fields with values of various types.

In practical use it’s a pretty close match to how we use tables for storing observations in a relational database. The big difference is the distinction between tags and fields; some of the columns are independent variables like who took an observation and some are dependent variables like the count of butterflies.

Designing my solar schema

Going back to my solar monitoring project, the whole reason I’m looking at this is the code I found creates a whole lot of measurements when I think it should create one.

For example there’s a measurement named “sunpower_pvs_inverter_operating_frequency_hertz” and a different one named “sunpower_pvs_inverter_heatsink_temperature_celcius”. Each measurement then has a single field typically named “gauge” where the actual data is stored, a number like the temperature or frequency. That feels clumsy to me, maybe it’s a consequence of the data being cast in Prometheus format?

To me it feels more natural to have a single measurement named “sunpower_pvs_inverter” and for each observation to have two fields with keys “operating_frequency_hertz” and “sunpower_pvs_inverter”. Each field would store the data directly; no “gauge”.

I agree more with the tag design of the code I found. For instance I have 28 inverters and these are represented in the data with tags like device_id=E00100128000007. But they went overboard with tags; there’s separate tags for software version, URL the data came from, etc. This might come in handy with a complex hetereogeneous system but it’s confusing complexity for my simple system.

Here’s the schema I think I want. It kind of tracks the schema implied in Sunpower’s own JSON output from the device_list call. Just four measurements.

PVS status. This is information on the monitoring computer itself. No tags needed, some field keys:

  • SWVER dl_err_count dl_comm_err dl_skipped_scans dl_scan_time dl_untransmitted dl_uptime dl_cpu_load dl_mem_used dl_flash_avail

Production. This is measuring data from the PVS6 CT sensors on the solar power production power line and has type “PVS5-METER-P” in the JSON. I don’t think I need any tags at all for this. Some field keys I will record values for:

  • STATE p_3phsum_kw net_ltea_3phsum_kwh freq_hz s_3phsum_kva q_3phsum_kvar p_3phsum_kw

Consumption. Again CT sensors, this on the house consumption side independent of the solar. This has type “PVS5-METER-C” in the JSON. There’s two values for many things because I have a split phase system with different sensors on each side. Some interesting values:

  • STATE net_ltea_3phsum_kwh p_3phsum_kw q_3phsum_kvar s_3phsum_kva freq_hz i1_a i2_a v1n_v v2n_v v12_v p1_kw p2_kw neg_ltea_3phsum_kwh neg_ltea_3phsum_kwh

Inverters. This is the detailed data from each solar panel, there’s one JSON block for each of 18 panels. Some values:

  • STATE SERIAL ltea_3phsum_kwh p_3phsum_kw vln_3phavg_v i_3phsum_a p_mppt1_kw v_mppt1_v t_htsnk_degc freq_hz

Inverter rollup. I might want to add a synthetic fifth measurement, something that summarizes the state of all the inverters. You could do this at query-time in Grafana or whatever but it’s probably easier to do it in the Python code which is collecting the data. Probably this is what I’d actually graph; trying to draw 28 different inverters separately seems like a mess. Some possible measurements

  • Count of working inverters
  • Sum, average and S.D. of inverter kW (both AC and DC)
  • Number of inverters well below average kW
  • Sum, average, and S.D. of inverter voltage and current (both AC and DC)
  • Average and SD of inverter temperature

I think in all cases the timestamp should be when my own monitoring code ran. Each bit of JSON data in the PVS output has its own timestamp and I bet that can matter sometimes, say if an inverter isn’t communicating correctly. But I could imagine a lot going wrong too; I trust my own clock more.

A big question for me is if I should try to filter and reprocess all the data coming from the PVS6 JSON. Or just insert it more or less as-is. Filter each JSON item into either a tag (“SWVER”) or a field (“killowatts”). That means inserting a bunch of junk into Influx I will never care to query. But then again it means not losing data, either.

Data meanings

This has nothing to do with Influx but I was trying to understand what some of these data values mean.

“LTEA” seems to stand for “life time accumulator” and is one of the few (only?) bits of data that has history. Most of the rest are instantaneous values; current watts, not watt-hours.

“3phsum” seems to refer to three phase power, presumably the sum of all three phases so you have a single number. I think my inverters are simple single phase so this doesn’t really apply, but maybe they use that word for consistency with other products. The only measurements of wattage are labelled “3phsum” so I’m gonna just ignore the details.