I have a Linux PC server that flakes out and locks up solid every few weeks. I tried using a software watchdog to fix it but it didn’t help. My guess is the whole CPU is freezing, so even the watchdog can’t run.
I’ve fixed the problem for now with a hardware watchdog. Some $10 anonymous Chinese hardware designed for Bitcoin mining rigs.
It looks like a USB device, but the USB is only for power. The main I/O are two pairs of wires: one that connects to your hard drive activity LED, one that connects to your hardware reset switch. Yes, it’s that dumb. Basically it just watches the LED and if it hasn’t flashed in awhile (no idea how long, maybe a minute?) it sends a reset to the motherboard.
It looks like a USB device, but the USB is only for power. The main I/O are two pairs of wires: one that connects to your hard drive activity LED, one that connects to your hardware reset switch. Yes, it’s that dumb. Basically it just watches the LED and if it hasn’t flashed in awhile (no idea how long, maybe a minute?) it reboots the computer.
It seems to work, although I haven’t had a real test yet. Most importantly, it hasn’t caused a false reboot after 36 hours of testing. It did reboot the computer when I ran /sbin/halt though, which is a good sign. I had to stare at the panel of my computer for awhile to verify the HDD activity light was actually flashing. Apparently this light works even for modern SSDs plugged in via M.2!
/sbin/shutdown -h, the command I always use to halt computers, doesn’t actually halt. See this systemd man page for details. “-h” is a synonym for “–poweroff”. You want “-H” or “–halt” to actually halt the CPU without also shutting off the system power. I imagine this is for historical reasons; back in the old days you couldn’t turn the power off with software. Then when that became possible it became the default because it’s almost always what you want.)
There’s a bunch of versions of this hardware watchdog idea; this one is reasonably well documented. Power could also come from an old school Molex connector or using the 9 pin USB motherboard connector, maybe with an adapter. I just routed the wires outside the case and used an external USB port. It’s all spectacularly dumb.
The one hassle with the device I bought is it doesn’t have cable splitters, it replaces the existing LED and switch. I bought some ridiculously overpriced splitters. (Turns out this kind of cable and connector is called “Dupont Line” or “Dupont Cable”.) Now my LED isn’t flashing any more; either I screwed up the wiring or there’s not enough power once split.
When I saw the USB port I assumed this device worked by monitoring the USB port itself. Test if the USB host was working. But that’s a fairly high level test and you could imagine various scenarios where it wasn’t appropriate. I imagine most computers doing useful work are still tickling the hard drive regularly enough this approach is better; if nothing else, than for logging.