IoTaWatt energy monitoring

I’ve been playing with an IoTaWatt energy monitor. This is a bit of hacker-friendly kit that measures electricity usage on each circuit in your breaker box. It gives you real time wattage data with 5 second samples and makes it much, much easier to understand where power is going in your house.

There’s a lot of products on the market now that do this kind of thing. Many are full consumer turnkey products like Neurio or Sense. They have cloud based monitors and some fancy pattern recognition to tell you things like “your refrigerator is failing”. I opted for the more hacker-friendly one, both to keep the data private and to be able to play with it more myself. See also OpenEnergyMonitor. I tell you, what I really want is something like a Leviton Smart Load Center, where the panel and breakers do this kind of monitoring themselves. (Price seems to be ~$4000 for a house, compared to $1500 for traditional dumb breakers.)

This post is long because I’m in the middle of learning about this stuff. But after running it just for 24 hours I learned some things like I have a leak in my water system somewhere, that my office uses 80W even when everything is “turned off”, and just how much electricity videogames use when you’re playing them. Having precise numbers for all this stuff is great.

How it works

The key sensor technology is the Circuit Transformer, a ~$10 metal coil you clamp around the hot wire of a circuit you want to monitor. The magnetic field in the main circuit induces current in the CT, which you can then read out the other end as either variable voltage or amperage. The common split-core CTs are installed without disturbing the wires in the breaker box at all, and nothing is electrically connected. Even so it’s important to keep the CTs plugged in to the controller box; if they aren’t they can build up a dangerous charge.

The CTs are plugged in with 3.5mm jacks to a controller box. The IoTaWatt is an Arduinio-class system that basically takes 14 CT sensors as input, stores the data on an SD card, and has a web interface over WiFi. The custom software (not Unix) includes a decent graphing interfaces and enough APIs to load the data up to some other system like Emoncms or your own stuff. It seems pretty good.

Installation

A temporary installation isn’t as hard as I feared. You take the panel for your breaker box off, exposing all the wires. Then you clip the CT sensors of the appropriate sizes around the wires you want to monitor. Done! You don’t even have to turn the power off first although obviously you should to be safe.

Of course it’s not so easy. The tricky part in North America is we use split phase power. We get 240V from the power company but it’s split into two out of phase 120V legs. The CTs only monitor a single leg, so if you’re measuring a 240V circuit (like a well pump or a dryer) you need to use two CTs, or pass both hot wires through a single CT, or double up the measurements. Also it’s easy to install a CT backwards, so the values are all the wrong sign. Fortunately the IoTaWatt has many software tools to compensate.

The other problem is a permanent installation. All the CTs and wires take up a lot of room in the breaker box. And the IoTaWatt also has to go somewhere, and be plugged in to two wall transformers. If you have a big breaker box you could just leave it inside (if code allows), but I’m not sure how well WiFi works there. I haven’t figured out a good permanent install.

What I measured

I just went with a simple temporary install to get some basic measurements. I only measured for a subpanel in my house and wasn’t able to get some of the more interesting things like the refrigerator. What I did get:

  • My office: a simple 110V circuit. This is where I spend all my time and includes my PC, my PS4, and my TV. Also overhead lights.
  • Furnace fan: simple 110V circuit? Gas heat, so fan only.
  • Pool: full 240V circuit. I’m only monitoring half of it, so I have to double the numbers reported. This may not be entirely accurate if the load is not balanced. But my 385W pump load shows up as 200W on the graph, so doubling is roughly correct.
  • Well pump: a 240V only circuit. I should double the numbers reported.
  • Septic pressure dose pump: another 240V only circuit, should double.
  • Subpanel supply: a full 240V circuit, with two monitors installed. I believe it’s correct to sum the two numbers when measuring the load.

Data and Graphs

Here’s some data I collected after running the monitor for 24 hours.

Power usage for individual circuits (already doubled as needed):

  • My office: 4600 Wh
  • Furnace fan: 1200 Wh
  • Pool: 3200 Wh
  • Well: 1400 Wh
  • Septic: 700 Wh
  • Total measured: 11,000 Wh
  • Subpanel supply 1: 7400 Wh
  • Subpanel supply 2: 9300
  • Total subpanel: 16,700 Wh (or about $4/day)

Office and Pool

Office and Pool

Here’s a graph of power usage over the course of a day. The yellow line is my office, the blue line is the pool.

The pool is easy to understand. Note the graph is at half scale, I forgot to click “double” in the setup. The filter pump turns on around 2am, has a brief priming jolt, then runs steadily at ~400W until 6am. There’s a 30 minute window of very high usage, 1800W, that’s when the second pump turns on to run the sweeper. There’s also an occasional usage of power during the day, well below 100W. That’s a little pool cover pump I have that runs when there’s enough water. It was raining this day. The pool uses 3200Wh a day or $8.

My office is harder to understand. At night it’s a steady 80W; that’s with most everything turned off or in sleep mode, but I still have a bunch of network equipment and vampire loads from TV gear. The load in the morning of about 200W is when I turn on my computer and do low power stuff like reading web pages. It spikes up to 400W when I’m playing a game on my PC. The 350W load is when my PC is relatively idle but I’m playing a game on my PS4. The TV seems to also spike the whole circuit load up to 200W. There’s also some overhead lights in the mix. Putting this all together, I can estimate these loads:

  • Constant load 24/7: 80W
  • Idle PC use in morning: +120W
  • PC gaming: +300W
  • PS4 gaming: +250W
  • Watching TV: +100W
  • Full day: 4600Wh, or ~200W average.

Furnace, Well, Septic

Furnace, well, septic

The furnace fan (in red) uses 120W when it’s on. It doesn’t run at night. It averaged about 50W this day.

The septic pump runs a few times a day; this value should be doubled. It’s sorta correlated to water usage but not really, it’s not as simple as “every time I flush the toilet”. It’s a high load when it runs, 2000W, but only briefly. Average daily usage is 30W.

The water pattern is more troubling. (Again, values in graph should be doubled). My well pump is coming on every 30 minutes and it’s basically uncorrelated to water usage in the house. That to me says I have a leak somewhere. I turned off the outside water while monitoring this and the frequency went from ~25 minutes to ~33 minutes, so while there may also be a smaller leak outside the big loss is coming from the indoor plumbing. I’m pretty sure I don’t have any leaky faucets or running toilets. Maybe it’s the well itself and the water system’s pressure tank. Apparently a common failure mode is that the check valve leaks pressurized water back up into the well, so it has to refill itself every once in awhile. The power usage isn’t so bad (60W average, or $15 a month) but I don’t want to waste water. And it’d be better not to have the wear and tear on the pump.

Subpanel Inputs

This is the graph of the power going into the panel. Both legs of the 240V circuit, I believe it’s correct to sum these. Not sure why they aren’t balanced more? The yellow line seems to go higher when I’ve got lights turned on and the office is busy, so maybe it’s uneven 110V loads. Not sure it really matters anyway. Note that the 5 circuits I’m monitoring in detail only add up to 11,000 Wh of the 16,700 Wh the panel supplied this day. I’ve got a bunch of other circuits I’m not monitoring, I guess they add up to about 1/3 the usage.

Conclusion

I like this IoTaWatt! Even with a temporary install I think I could learn enough in a few days to model out all the power usage of my house. Idle load, average load, and max load for every circuit in the house. That will be hugely helpful when designing a power backup for PG&E’s stupid outages, whether a generator or a battery.

What I really want is to have this kind of monitoring all the time, storing history in the background. Would love to see how my usage patterns vary over time, or spot new water leaks, or realize when the refrigerator is failing. I think the IoTaWatt is sufficient to do that, at least if you can find a nice way to install it permanently.

I do think an integrated system like the Leviton Load Center makes more sense though. I wonder if their system is in any way open? I fear the answer is no. No hint it works with any of the open source software, and Leviton doesn’t even have it working with Alexa yet. I bet there’s a way to scrape data out of the web app.

Bonus

Another graph, this of the next 24 hours. All the doubling, etc is done correctly.

Backup cell ISP notes

My Internet in Grass Valley is flaky enough that I’m looking at getting a permanent mobile service as a whole-house backup. Right now I just use my phone as a hotspot and switch my computer to it when necessary. That works OK but it only gets my own computer, I’m curious about switching the whole house. (Also dreading it; what if one of the devices decides to download a 10 GB update while on mobile?)

Ethernet / router options

The Ubiquiti EdgeMAX EdgeOS router I uses supports having a second WLAN connection as either load balancing or pure failover. Obviously I’d want failover. The trick is that you need to plug the mobile device into an ethernet port on the router. So the usual wireless hotspot doesn’t work.

The one device that looks exactly right is the Netgear 4G LTE Modem. $110 for a SIM card in, ethernet out. Perfect! See review here. Except it explicitly says it does not support Verizon, my carrier of choice. Some of the user comments suggest it does work on Verizon but my guess is only the LTE works and if you have to fall back to 3G it can’t do CDMA. Not ideal. I could use it with AT&T. Also it does not support LTE Advanced.

This review site’s roundup of ethernet-out devices is discouraging. Only Netgear makes the simple ethernet-out product. There’s a couple of things that are also wireless hotspots or routers that have an ethernet port, maybe they could be adopted. Or something really expensive like this new 5G hotspot. One final option would be to get a USB device and use a Raspberry Pi to output ethernet, assuming it’s possible.

Another option that looks slick is the Ubiquiti Redundant WAN over LTE. $200 for some solid-looking LTE hardware that plugs into your ethernet. Also has a setup with AT&T for service already, $15/mo plus $10/GB. The only point of confusion for me is how the failover happens. The device says it’s for UniFi networks, so I imagine it does some Ubiquiti-proprietary coordination with the router (UniFi Security Gateway) to act as a failover. If so then it probably won’t work with my old EdgeOS device.

Back to hotspots

So maybe using a hotspot is the right idea and I should just get a dedicated hotspot device and service. It’s manual failover and won’t cover my whole house, but there’s lots of product choices. The Inseego MiFi 8800L Jetpack (aka Verizon Jetpack) looks terrific. It even supports LTE Advanced! It doesn’t support Verizon 3G which is weird for a product Verizon is explicitly selling. Maybe that’s OK though, 3G is on the way out in the US.

OTOH just using my phone works well enough for the few times I need to do this. Never been sure I really needed a dedicated device. I’m using a Pixel 3 and it supports LTE Advanced.

Mobile data providers

If I get a dedicated device I need a data contract and SIM for it.

I can add a second line to my Verizon account. $10/mo and then $10/GB. I’m unclear but I think this will share my phone’s data balance? I always have way more data than I need, so that seems like a good choice.

I took a quick look at AT&T prepaid MVNOs, as something I could use with that Netgear modem. The best I could find in the US was Red Pocket Mobile. $10/mo for 1GB, and then $10/GB for topups. Or $25/mo for 5GB if you know you’re going to use it a lot. I did not look at getting a proper monthly account with AT&T, or adding a line to existing AT&T service. I suspect the pricing is the same as Verizon.

Conclusion

Three options

  1. Stick with using phone as my backup. A hassle, and manual, but it’s paid for already.
  2. Get a Jetpack mobile hotspot and switch to it manually. More convenient, but still a hassle. Small extra monthly fee.
  3. Get the Netgear LTE Modem and an AT&T MVNO. Give me proper failover, although there’s the risk we then blow through the data thanks to some ignorant device. Have to buy a new contract, so can’t use existing data on the Verizon account.

Option 1 sure is the least demanding up-front.

Some rough watt-hour numbers

PG&E’s incompetence means I’m thinking about electricity storage recently. Some rough numbers:

Devices you charge and use:

Batteries you can charge things from:

There’s a huge variety in the stuff I lumped together as “charger”, particularly how well they furnish AC or DC for usage. Your car battery is only as good as the inverter you plug into the stupid cigarette lighter, for instance. That EGO Power Station is designed to be a full portable AC power solution that can put out 3000W of clean power (well, for 20 minutes). Briefcase solar systems can also recharge themselves from solar panels.

I currently own a Powercore and two spare UPS units, for a total of about 275Wh. With that I can charge my laptop 4 times, my iPad 8 times, or my phone 25 times. I also have the car which is good for about 3x more charging just off the battery, which doesn’t count idling the engine to keep its battery charged.

Update learned a couple of things in the latest outage. The car inverter works pretty well, but my 2017 A3 only puts power out the 12V sockets if the engine is running. You can leave it running unattended if you start the engine with the door open. The car manual says 120W maximum out the socket, so my 300W inverter purchase is overkill and I guess I could blow a fuse. Other thing I noticed is the inverter was no good for trying to charge a CyberPower 875AV UPS; the display on the UPS showed it was picking up a little charge but not really. The cheap TrippLite UPS I have did seem to charge better from the inverter, or at least the inverter got awfully warm.

My electric bill says my house averaged using 35,000 Wh a day for the last year. A lot of that is optional load; in a low power situation I wouldn’t be running the pool pump, the electric oven, etc.

A single rooftop solar panel generates roughly 300W maximum, or 1370 Wh per day (average over the course of a year in NorCal).

Archiving and deleting 23andMe

I decided to delete my 23andMe profile. All told it took about a week and I have a copy of everything I care about. Could have gone faster but I was careful and slow about it.

I did the test in 2011 with mixed feelings; genetic data is so super private and yet also so readily available to a determined attacker. I was curious what the reports would say. And I kind of wanted to support the company, I believed their research mission.

The consumer product never quite excited me. Other than one very significant thing (story forthcoming) it’s told me nothing useful. A lot of the stuff they highlight is dubious. I don’t see any reason to continue to pay for access to it.

And then there’s the law enforcement fuckery. Police keep getting more and more aggressive about accessing genetic databases. It’s not clear the courts will protect people. Time to delete the data.

Note that deletion isn’t really deletion. 23AndMe seems to make a good faith effort to remove data and destroy samples. But they may have shared some data and it’s no longer entirely under their control. Also your entire genetic data is stored “to meet CLIA requirements”, apparently some regulatory requirement where the government wants labs to keep data so they can evaluate lab quality. In theory all that stuff should be anonymized but I wouldn’t trust that against a determined attacker. If I’d known all this back at the beginning I never would have done the test.

So, what to do when deleting your 23AndMe account? Here’s what I did. Note I have very old data in their system, from a genetic test in 2011.

  1. Contact any connections you have and let them know you’re deleting your data. If you shared something with a family member it might be nice to give them a couple of days to take one last look at your data.
  2. Download all user data. This page is accessible from Settings, at the bottom under Data. It has buttons for downloading various sections. The reports data wants to print; I printed that to a PDF file.
  3. Be sure to download raw genetic data. It doesn’t download immediately; it creates a request and you get an email when it’s ready. (Mine took just a few minutes and was 24MB). This is the most important data on the site, the raw genetic test output. It’s all your SNPs. It’s possible to analyze this data with lots of tools, both online and offline.
  4. Download your archived reports. This only applies to pre-2016 accounts, before the FDA settlement, and only contains old science reports.
  5. Browse your current reports. One last trip down memory lane. It turns out I do not have the Neanderthal trait that makes people sneeze after eating dark chocolate. Thanks, 23AndMe, that’s definitely worth risking my genetic privacy to know!
  6. Consider downloading / printing any detailed reports you are interested in. They seldom contain any more personal data than is in the summary report but the surrounding context may be useful.
  7. Consider downloading Profile Data. This is a record of inputs you’ve given the site, like address changes and your self-reported phenotype. You have to request a download; it took two days for me. Mine was a few uninteresting tiny CSV files so I’m not sure it was worth the bother.
  8. Request account closure. That’s the final step, the deletion. The big red button is at the bottom of this page. They send you an email to confirm with some caveats.

Once I submitted the final deletion my account looked inaccessible the moment I tried to log in. Well I could log in, but it looked like an empty profile. The site didn’t promise any final notification when the deletion is completed, no idea whether I’ve simply been marked as status: deleted or they actually scrubbed my data. Presumably destroying biological samples takes some time.

As noted above, various copies of your genetic data will be kept for years and there’s nothing you can do to delete it. I believe these are all anonymized and won’t be easy to associate with your name, but that still seems pretty crappy to me.

I will give 23AndMe credit, nowhere in this process was there some retention process slowing me down. They never even asked why I was deleting! I appreciated that; it’s much harder to, say, cancel cable TV.

23AndMe’s deletion details

This is the disclosure in the email from 23AndMe about what deletion means. Note the third bullet item about “will retain your Genetic Information”; despite my request they are not actually deleting my data.

We received your request to permanently delete your 23andMe account and associated data. The following apply when you submit your deletion request:

  • If you chose to consent to 23andMe Research by agreeing to an applicable 23andMe Research consent document, any research involving your data that has already been performed or published prior to our receipt of your request will not be reversed, undone, or withdrawn.
  • Any samples for which you gave consent to be stored (biobanked) will be discarded.
  • 23andMe and the contracted genotyping laboratory will retain your Genetic Information, Date of Birth, and sex as required for compliance with legal obligations, pursuant to the federal Clinical Laboratory Improvement Amendments of 1988 and California laboratory regulations.
  • 23andMe will retain limited information related to your data deletion request, such as your email address and Account Deletion Request Identifier, as necessary to fulfill your request and for the establishment, exercise or defense of legal claims.

Once you confirm your request to delete all of your data, 23andMe will begin processing your request. This decision cannot be cancelled, undone, withdrawn, or reversed.

Better NVidia driver updates

NVidia makes excellent graphics cards with decent drivers.

They also make shitty marketing-ware bloated with crap you don’t want. I’m referring to GeForce Experience, the wrapper around the drivers that runs on Windows. It claims its primary job is updating your drivers. But somehow it needs 100s of megabytes to do that, has a bunch of stupid features in it you don’t want, and requires a login to an NVidia website account to work. Worse, the login requires a CAPTCHA. Right there on my Windows desktop. Stupidest damn thing. Of course the real reason for all this is building ad profiles and marketing and blah blah blah. See also: “NvTelemetry”.

One option for GeForce Experience is simply not to use it. You can still manually download drivers online and install them by hand. For now. But that’s a hassle.

A better option is TinyNvidiaUpdateChecker, a very minimal app for updates. It looks at your machine for the installed version of the drivers, looks online for newer versions, and will install an update for you. It runs super fast, requires no login, and just does its job without getting in the way.

The one drawback is it really is minimal. The default setup is to run once and display its UI in a console window (lol). There’s a guide for making it run at system startup instead, which boils down to “make a shortcut with the –quiet argument and make it run at startup”. It’d be nice if it had an installer and an optional system tray mode.

There’s also an advanced feature called the minimal installer. That breaks apart the NVidia update and only installs some pieces. The goal is to avoid GeForce Experience and NvTelemetry, which are laudable. But the docs makes it sound like you might also not get PhysX, HD Audio, etc. That doesn’t sound great to me, but maybe I misunderstand.

Updating nameservers in Ubuntu 19, Pi Hole edition

For some reason, every time I reboot my Ubuntu 19 box had /etc/resolv.conf configured to point the name server at 127.0.0.1. There’s no nameserver running there, so it fails. I want it set to 192.168.0.1. Where is this coming from? Is it related to having had Pi Hole installed at one time? systemd? What?

tl;dr: there’s like 5 things in an Ubuntu 19 system that might be modifying your name server, and tracking it down is really terrible.

In the old days you edited /etc/resolv.conf and were done. (libc knows to read this file and use it when resolving names. Crazy, huh?) That works transiently but is undone if you reboot. If you look at the file there’s a dire notice

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.1

OK, so systemd is a DNS resolver too now? Awesome. Don’t let the 127.0.0.53 surprise you; that’s basically another name for 127.0.0.1, localhost. Only they’re being sorta clever. But my system is broken; it’s not set to .53, it’s set to .1.

Then you go down the rabbit hole. That command systemd-resolve --status is giving you systemd’s idea of how to resolv names. That in turn seems to be configured by files in /etc/netplan, something you may have created if you configured static networking. Changing those will (presumably) alter the behavior of systemd’s DNS server.

But my problem is my /etc/resolv.conf was being regenerated at boot time to point to 127.0.0.1, not 127.0.0.53. systemd’s resolver is never involved. How to fix that? Second rabbit hole; there’s a system called resolvconf that might be overwriting it with information in /etc/resolvconf. Only what’s on my system isn’t the real (old) resolvconf, it’s actually systemd’s resolvectl running in a resolvconf-compatibility mode. I fiddled with it for awhile and I still don’t understand how this stuff works, but it seems to not be writing this entry.

I finally did a grep and found even though I’d uninstalled Pi Hole, it left behind a /etc/init.d/pihole-FTL script. Which is running resolvconf at boot time to re-set the nameserver to 127.0.0.1. This script shouldn’t still be on my system, and removing it should stop the clobbering. So I removed and rebooted.

Hah, joke’s on me! Every reboot, the 127.0.0.1 entry gets rewritten. Where is it coming from? I found it in /run/resolvconf/interface/enp3s0.dhcp. I’m not using DHCP, I have a static IP address. But that encouraged me to pay attention to /etc/dhcpcd.conf which, yes, is the actual source of the text overriding resolv.conf. I changed it from “static domain_name_servers=127.0.0.1” to 192.168.0.1 and name service works on reboot! Who knows why this matters.. dhcpcd is still running; it appears to be used to configure interfaces with static addresses even if DHCP isn’t involved.

But systemd still has one joke left on me. Now when I cat /etc/resolv.conf there are two name servers.

nameserver 192.168.0.1
nameserver 127.0.0.53

I have no idea why systemd decided to inject its .53 in there. It didn’t when the DHCP wrote 127.0.0.1 in there, but change that number to 192.168.0.1 and now systemd’s all in a hurry to put itself there. It may be coming from files I’m afraid to edit in /run. Anyway now half the queries go directly to my name server, half are looped through systemd first.

Fortunately the systemd loop seems to be working. I can’t dig @127.0.0.53 but no ordinary query ever hangs in a way consistent with a broken nameserver. systemd-resolve –status suggests that resolver is configured usefully (forwarding all requests to 192.168.0.1). So I’m just gonna leave it alone since it’s working, even if it’s not what I want.

The problem with systemd isn’t just that it’s a giant beast that tries to do everything. It’s that it’s also magic and poorly documented.

Linux watchdogs in 2019

My home server is dying unexplainedly. Totally locks solid, nothing in the logs, very confusing. While I figure out what’s wrong (power supply?) I decided to go implement a watchdog to reboot the system if it fails.

This turns out to be hard in 2019, with Ubuntu 19. tl;dr there are two choices: systemd or the good ol watchdog daemon. See also: softdog.

systemd has watchdog support. Unlike many things in the systemd hydra, this one makes sense to me to integrate. It’s configured in /etc/systemd/system.conf. However it’s not well documented and relatively limited, so I decided not to use it. Even so my system has a PID 77 named [watchdogd] that I think is part of systemd. No idea what it’s doing, if anything.

watchdog is the old school Linux daemon that does watchdoggy things. Ubuntu 19 offers version 5.15-2, and if you install it by default it doesn’t do much. You have to configure it via /etc/watchdog.conf, ensure it loads at startup, and (maybe) install a kernel module to help it.

watchdog works by running a bunch of different tests. It’ll try pinging a network address, check if a file has been modified, see if the load average is too high or if there’s no RAM available. If not it’ll reboot the system. The shutdown is internal, it doesn’t fork any processes.

By default the Ubuntu watchdog.conf has no tests enabled. I think this means it does nothing at all. (It’s possible there’s some “if the system is totally dead then reboot” thing still hiding in there, but if so I don’t see it.) To be useful you want to configure various tests. I have it pinging my router; in theory my computer is still working even if the router is down, but in practice it seems more likely my server’s networking has died. (There’s a systemd upgrade bug.). I’m sure this will end up shooting me in the foot some day.

This is what a watchdog shutdown looks like in the syslog

Oct 13 18:57:31 ub watchdog[30919]: no response from ping (target: 192.168.3.1)
Oct 13 18:58:33 ub watchdog[30919]: message repeated 31 times: [ no response from ping (target: 192.168.3.1)]
Oct 13 18:58:33 ub watchdog[30919]: Retry timed-out at 62 seconds for 192.168.3.1
Oct 13 18:58:33 ub watchdog[30919]: shutting down the system because of error 101 = 'Network is unreachable'

BTW, this is the point to note that in theory if you screw up a watchdog config bad enough, the system might reboot itself so fast you can never get in and fix it without rebooting single-user mode at a console. Fortunately the default config is to reboot after 1-2 minutes of failure, giving you time to get in and fix anything dumb or disable watchdog entirely.

What happens if the machine is so locked up that the user space watchdog process can’t run at all? What will trigger a reboot then? Enter kernel support for watchdogs, a feature that goes back to 2002. The basic idea is if some process ever writes to a file named /dev/watchdog, if that file is not written to once a minute the kernel will reboot itself. The kernel’s own watch on itself is implemented at a low level with some sort of CPU timer. Serious systems have extra hardware for this kind of self monitoring, but this method should work reasonably well on a consumer PC unless the kernel itself or the whole CPU locks up.

However if you look on Ubuntu you don’t have a /dev/watchdog file. You have to install it. The simple way is to “modprobe softdog”. Getting this to happen at boot time is remarkably difficult because the module is blacklisted and systemd refuses to load it. The best workaround is to modify /etc/default/watchdog to load “softdog” as a module, fortunately they thought ahead on the need for this. Once you can do that you can enable the test in watchdog.conf.

Putting it all together, here’s how to enable a watchdog in an Ubuntu 19 system

  • apt install watchdog
  • edit /etc/default/watchdog to load the “softdog” module
  • edit /etc/watchdog.conf to enable the tests you want. I enabled ping and watchdog-device.
  • run “systemctl start watchdog” to enable it (or reboot)
  • check the syslog to see that watchdog is logging your active tests and looks reasonable

Update

Sadly the softdog isn’t enough to fix my problem; apparently the computer freezes so solid that the little bit of software needed to reboot doesn’t run. Maybe the CPU locks up entirely, I have no idea.

One escalation would be a USB hardware watchdog like this or this. Looks like it doesn’t require software at all; it plugs right into the hardware reset switch on the motherboard. Presumably it uses USB to power itself and monitor if the computer is working. These are sold for Bitcoin mining rigs, that’s hilarious.

I wish I could find out what was causing the computer to freeze up and replace it. Could be the power supply or motherboard or CPU or RAM.