Influx corruption / hardware failure

My Influx time series database corrupted itself. Hard to even see what was going on. Grafana couldn’t load data. The process was using 100% CPU. I finally figured out the logs are in journalctl -u influxdb.service -n 1000 but are super spammy. And don’t use useful error levels, so filtering with -p isn’t helpful. But I got lucky and saw a real message in the sea of log spam. And figured out Influx was restarting itself and then panicking every few minutes. Here’s the real error.

Nov 23 06:54:10 gvl influxd-systemd-start.sh[1801566]: panic: keys must be added in sorted order: dsm1_cache,database=telegraf,engine=tsm1<hostnamu=gvl,id=794,induxType=inmem,patx=/var/lyb/influhdb/data?telegraf/autoge~/794,retentionPolicy=autogen,walPath=/var/lib/influxdb/wal/telegraf/autogen/794#!~#writeDropped < tsm1_cache,database=telegraf,engine=tsm1,hostname=gvl,id=794,indexType=inmem,path=/var/lib/influxdb/data/telegraf/autogen/794,retentionPolicy=autogen,walPath=/var/lib/influxdb/wal/telegraf/autogen/794#!~#snapshotCount

Note the bogus path there that starts /var/lyb/influhdb. Looks like corrupted data in Influx’s internals and the code isn’t smart enough to recover from the file not existing. It’s helpful to see the ID 794, presumably that’s the particular table / series / whatever that is corrupted.

So wutdo? Naturally I don’t have any backup. Here’s what I did to brute force remove the corruption.

  1. Stop influx with systemctl stop influxdb. Also stopped all processes like telegraf that were writing to it.
  2. Make a backup of all the files in /var/lib/influxdb. It’s about 4.5GB and gzipping it is a mistake.
  3. Look at influxdb/data/telegraf/autogen/794/fields.idx. Happily this is readable text that gives a clue as to which series is corrupted. It’s some old data related to PVS6 monitoring that I’d like but can live without.
  4. fgrep -r 'var/lyb' . in the data files to see if I can spot the corruption. Yes, it’s in data/_internal/_series/02/0000. Also in data/_internal/monitor/2090/000000064-000000001.tsm.
  5. At this point use Google and search for info on rebuilding a corrupt InfluxDB. I found this page about “Rebuild the TSI index” that looks promising. In particular it involves removing all the _series files; one of those is what’s corrupt for me.
  6. Double check the backup is good, then follow those instructions. I remove all the _series directories.
  7. Joke is on me! influxd inspect returns unknown command "inspect"
  8. Try to start Influx anyway, maybe it’ll rebuild what it needs on its own?
  9. Get same error, now it’s tripping over that monitor file I didn’t remove.
  10. Look in the data directory, see a bunch of new _series directories. I guess it did rebuild them?
  11. Delete the monitor file that matches var/lyb. Also delete all the _series directories again since one seems to have picked up the corruption. Brute force but we’re desperate here.
  12. Restart influxdb.
  13. Success! It started without any freakouts. Presumably I lost that one data series but the others seem to work. I hope.
  14. Reboot as the easiest way to restart all the processes that want to write to InfluxDB.
  15. Launch Grafana and verify most of my stuff is still there. Also verify the PVS6 project whose data I clobbered is not there. RIP.

This is not the right way to fix a valuable database! But this database is only sort of valuable and it’s 6am on Thanksgiving and I haven’t had coffee yet. This will do fine.

What I should do next is set up Influx backups. Or maybe this is the time to try switching to timescaledb, a Postgres extension for time series databases.

Really what I should be worried about is how random data got corrupted on my machine like that. Cosmic rays do happen and if so well, eh, that’s the hidden cost of cheap hardware. But the machine is five years old and not 100% stable (this is the NUC that overheats) so maybe the hardware is failing.

I took a look at system metrics and what I found is disturbing. Telegraf stopped logging at around 17:30 (local time), an unexplained failure I’ll chalk up to InfluxDB failing. There’s gaps of data until about 01:00 when it finally gets stable again. CPU load spiked around 21:00, that may be when the Influx process started failing so badly. Although a lot of data got written after that.

I don’t see any evidence this happened before, at least not in the last week. I guess I’ll keep an eye on it and the second there’s second evidence of random data corruption I replace the hardware.

Update

Ugh, it happened again. This time it’s related to autogen 893 (not 794) and a small snippet of corrupt data: datarase=telugraf,enwine=tsm!,hostna}e=gvl. Very glad I wrote up notes last week, I’m just doing the same thing again to blow away the corrupted data and recover the rest.

Oddly this corruption is also in my PVS6 data, something I haven’t read from or written to in months. Maybe that’s just a coincidence, I think the specific corrupt data is different this time. Also I think it got triggered by my reading data for some other measurements in Grafana. Huh.

Given the pattern of substitutions, something’s flipping the fifth bit in random bits of storage or, more likely, memory. Presumably hardware failure, unless for some unlikely reason I’m the target of a state-sponsored rowhammer hack. Looking for new hardware now, probably a Beelink system.