Updating nameservers in Ubuntu 19, Pi Hole edition

For some reason, every time I reboot my Ubuntu 19 box had /etc/resolv.conf configured to point the name server at 127.0.0.1. There’s no nameserver running there, so it fails. I want it set to 192.168.0.1. Where is this coming from? Is it related to having had Pi Hole installed at one time? systemd? What?

tl;dr: there’s like 5 things in an Ubuntu 19 system that might be modifying your name server, and tracking it down is really terrible.

In the old days you edited /etc/resolv.conf and were done. (libc knows to read this file and use it when resolving names. Crazy, huh?) That works transiently but is undone if you reboot. If you look at the file there’s a dire notice

# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.1

OK, so systemd is a DNS resolver too now? Awesome. Don’t let the 127.0.0.53 surprise you; that’s basically another name for 127.0.0.1, localhost. Only they’re being sorta clever. But my system is broken; it’s not set to .53, it’s set to .1.

Then you go down the rabbit hole. That command systemd-resolve --status is giving you systemd’s idea of how to resolv names. That in turn seems to be configured by files in /etc/netplan, something you may have created if you configured static networking. Changing those will (presumably) alter the behavior of systemd’s DNS server.

But my problem is my /etc/resolv.conf was being regenerated at boot time to point to 127.0.0.1, not 127.0.0.53. systemd’s resolver is never involved. How to fix that? Second rabbit hole; there’s a system called resolvconf that might be overwriting it with information in /etc/resolvconf. Only what’s on my system isn’t the real (old) resolvconf, it’s actually systemd’s resolvectl running in a resolvconf-compatibility mode. I fiddled with it for awhile and I still don’t understand how this stuff works, but it seems to not be writing this entry.

I finally did a grep and found even though I’d uninstalled Pi Hole, it left behind a /etc/init.d/pihole-FTL script. Which is running resolvconf at boot time to re-set the nameserver to 127.0.0.1. This script shouldn’t still be on my system, and removing it should stop the clobbering. So I removed and rebooted.

Hah, joke’s on me! Every reboot, the 127.0.0.1 entry gets rewritten. Where is it coming from? I found it in /run/resolvconf/interface/enp3s0.dhcp. I’m not using DHCP, I have a static IP address. But that encouraged me to pay attention to /etc/dhcpcd.conf which, yes, is the actual source of the text overriding resolv.conf. I changed it from “static domain_name_servers=127.0.0.1” to 192.168.0.1 and name service works on reboot! Who knows why this matters.. dhcpcd is still running; it appears to be used to configure interfaces with static addresses even if DHCP isn’t involved.

But systemd still has one joke left on me. Now when I cat /etc/resolv.conf there are two name servers.

nameserver 192.168.0.1
nameserver 127.0.0.53

I have no idea why systemd decided to inject its .53 in there. It didn’t when the DHCP wrote 127.0.0.1 in there, but change that number to 192.168.0.1 and now systemd’s all in a hurry to put itself there. It may be coming from files I’m afraid to edit in /run. Anyway now half the queries go directly to my name server, half are looped through systemd first.

Fortunately the systemd loop seems to be working. I can’t dig @127.0.0.53 but no ordinary query ever hangs in a way consistent with a broken nameserver. systemd-resolve –status suggests that resolver is configured usefully (forwarding all requests to 192.168.0.1). So I’m just gonna leave it alone since it’s working, even if it’s not what I want.

The problem with systemd isn’t just that it’s a giant beast that tries to do everything. It’s that it’s also magic and poorly documented.

Linux watchdogs in 2019

My home server is dying unexplainedly. Totally locks solid, nothing in the logs, very confusing. While I figure out what’s wrong (power supply?) I decided to go implement a watchdog to reboot the system if it fails.

This turns out to be hard in 2019, with Ubuntu 19. tl;dr there are two choices: systemd or the good ol watchdog daemon. See also: softdog.

systemd has watchdog support. Unlike many things in the systemd hydra, this one makes sense to me to integrate. It’s configured in /etc/systemd/system.conf. However it’s not well documented and relatively limited, so I decided not to use it. Even so my system has a PID 77 named [watchdogd] that I think is part of systemd. No idea what it’s doing, if anything.

watchdog is the old school Linux daemon that does watchdoggy things. Ubuntu 19 offers version 5.15-2, and if you install it by default it doesn’t do much. You have to configure it via /etc/watchdog.conf, ensure it loads at startup, and (maybe) install a kernel module to help it.

watchdog works by running a bunch of different tests. It’ll try pinging a network address, check if a file has been modified, see if the load average is too high or if there’s no RAM available. If not it’ll reboot the system. The shutdown is internal, it doesn’t fork any processes.

By default the Ubuntu watchdog.conf has no tests enabled. I think this means it does nothing at all. (It’s possible there’s some “if the system is totally dead then reboot” thing still hiding in there, but if so I don’t see it.) To be useful you want to configure various tests. I have it pinging my router; in theory my computer is still working even if the router is down, but in practice it seems more likely my server’s networking has died. (There’s a systemd upgrade bug.). I’m sure this will end up shooting me in the foot some day.

This is what a watchdog shutdown looks like in the syslog

Oct 13 18:57:31 ub watchdog[30919]: no response from ping (target: 192.168.3.1)
Oct 13 18:58:33 ub watchdog[30919]: message repeated 31 times: [ no response from ping (target: 192.168.3.1)]
Oct 13 18:58:33 ub watchdog[30919]: Retry timed-out at 62 seconds for 192.168.3.1
Oct 13 18:58:33 ub watchdog[30919]: shutting down the system because of error 101 = 'Network is unreachable'

BTW, this is the point to note that in theory if you screw up a watchdog config bad enough, the system might reboot itself so fast you can never get in and fix it without rebooting single-user mode at a console. Fortunately the default config is to reboot after 1-2 minutes of failure, giving you time to get in and fix anything dumb or disable watchdog entirely.

What happens if the machine is so locked up that the user space watchdog process can’t run at all? What will trigger a reboot then? Enter kernel support for watchdogs, a feature that goes back to 2002. The basic idea is if some process ever writes to a file named /dev/watchdog, if that file is not written to once a minute the kernel will reboot itself. The kernel’s own watch on itself is implemented at a low level with some sort of CPU timer. Serious systems have extra hardware for this kind of self monitoring, but this method should work reasonably well on a consumer PC unless the kernel itself or the whole CPU locks up.

However if you look on Ubuntu you don’t have a /dev/watchdog file. You have to install it. The simple way is to “modprobe softdog”. Getting this to happen at boot time is remarkably difficult because the module is blacklisted and systemd refuses to load it. The best workaround is to modify /etc/default/watchdog to load “softdog” as a module, fortunately they thought ahead on the need for this. Once you can do that you can enable the test in watchdog.conf.

Putting it all together, here’s how to enable a watchdog in an Ubuntu 19 system

  • apt install watchdog
  • edit /etc/default/watchdog to load the “softdog” module
  • edit /etc/watchdog.conf to enable the tests you want. I enabled ping and watchdog-device.
  • run “systemctl start watchdog” to enable it (or reboot)
  • check the syslog to see that watchdog is logging your active tests and looks reasonable

Epic Launcher: running copied games

The Epic Launcher on Windows stores its games in C:\Program Files\Epic Games. However if you copy a game from one computer to another, the Epic Launcher on the new computer won’t recognize it or let you play it. Super annoying if you want to avoid re-downloading 60GB of stuff you already have a copy of. (Steam does not have this nuisance.)

The workaround for this is to first start installing the game on the new computer using the Epic Launcher. Let it download for a few seconds, then abort the install (don’t just pause it) from the launcher. Then go in to the folder and delete the game directory on the new computer. And then copy the files from the old computer to the new. Finally in the Epic Launcher click “Resume” and it will verify the files rather than re-download and let you play the game.

This solution is described in How to Move Fortnite to Another Folder, Drive, or PC. The problem is also discussed in EpicGamesLauncher – Recognising already installed files which includes some hints about a C:\ProgramData directory that may contain the hidden state that’s missing if you just copy the program files. I didn’t pursue that avenue for making things work but it’s a good idea.

Upgrading systemd kills Ubuntu networking

Two or three times this year I’ve had a bug in a relatively fresh Ubuntu install where upgrading systemd kills the networking. I’ve got no idea what’s wrong and no physical console on the machine to inspect it. It feels like something simple like the upgrade script shut down networking but then didn’t bring it back again, but of course it could be anything.

So far this is happening only on my 18.10/19.04 box in San Francisco. Fortunately I have physical access so at least I can reboot it (usually). I haven’t seen it happen on my 18.04 box in a datacenter.

Ubuntu has a couple of recent bugs about this. Bug 1803391 boils down to a one time bug in an upgrade script in systemd. Bug 1782709 is open and confirmed with many reports and no clear idea what might be wrong. It looks like the kind of bug report that stays open for 4 years until someone closes it as irrelevant :-(

I do not want to learn so much about the guts of systemd and dpkg to diagnose this. systemd is particularly gnarly to debug since it does so much. Among other things it’s also capturing system logs, so there’s a risk whatever is being logged about the error is disappearing during the upgrade. Or not, who knows?

Again, I’ve made my peace with systemd, but this bug is a great example of the danger of centralizing so many functions into one component.

Edit I’ve placed a hold on systemd upgrades for now. Never upgrading systemd is not a solution, but this will serve as a reminder for me to only do the upgrade if I’m able to access the physical machine.

systemd and PolicyKit1

I’ve mostly made my peace with systemd but its usability sure leaves a lot to be desired. My fun interaction for today, trying to start a small service:

$ systemctl start hover-dns-updater.service
Failed to start hover-dns-updater.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status hover-dns-updater.service' for details.

A surprising error, particularly since I’m not even using any sort of desktop Linux. And no, the suggested systemctl status command doesn’t show any information at all about why the service failed to start, the log is empty.

The simple fix was to re-run systemctl as root, with sudo. Everything worked! That’s probably what one should do in reality anyway. I’m not up to date on Ubuntu’s current status of least privileges and user accounts, but in general “starting system services requires root” doesn’t surprise me.

However, there is another way. Install the policykit-1 package with apt. That will allow that systemctl command to work without being root. It requires you type your user password to authenticate yourself (again), so it’s not clear where it’s really useful.

I’ve never heard of policykit. This is what the README says:

polkit is a toolkit for defining and handling authorizations. It is used for allowing unprivileged processes to speak to privileged processes.

802.11 bad signal diagnostics (Ubiquiti)

Ken’s computer was having a bad time connecting to our wifi network via 802.11n. I finally figured out it was mostly a problem with signal strength in his room, a second computer displayed the same symptoms with that same 802.11n adapter. We solved the problem by upgrading him to a 802.11ac adapter. Along the way I learned some things about WiFi signal reliability. Usual caveats applied: I’m just learning this stuff myself and am an amateur

802.11 link layer reliability

WiFi’s link layer has its own set of complex reliability stuff built in, sort of analogous to TCP but solving a different problem. That makes WiFi quite different from ethernet, which AFAIK doesn’t have any link layer stuff for handling a lost packet other than the basic collision detection logic (nearly irrelevant in modern switched networks). There’s also some rate negotiation in modern ethernet; if 1000Base-T isn’t working reliably it’ll step down to 100Base-T. But that’s about it. I think ethernet makes the assumption its link layer is basically reliable and figures it’s the upper layers’ problem (ie TCP) to deal with lost packets.

WiFi is an inherently flaky medium though. It’s expected that not all packets will get through the radio √¶ther unmolested by gremlins. So the 802.11 link layer has a couple of mechanisms built into it for reliability.

One tool is forward error correction. The codewords broadcast on 802.11 aren’t just the bits; they’re error correcting codes capable of fixing single bit errors at varying rates. I believe in practice wifi sends anywhere from 2 bits for every 1 bit of user data to 6 bits for every 5.

A second tool is link layer retries, something a little like TCP’s retry and ACK protocol. This is called “ARQ (Automatic Repeat Request)” in what I’ve read, and basically amounts to tracking sequence numbers and resending packets when not acknowledged. I don’t know how many retries are attempted or whether retries stall the link, this article has details that suggest it’s 4 or 7 retries. But note all this retry is happening on one local wireless link, in under a millisecond. Very different from TCP’s waiting a full round trip for a retransmit.

A final important tool is speed negotiation. Much like TCP uses lost packets and ACK clocking to figure out the maximum speed, WiFi gear monitors error rates and if it sees too many, steps down to a less aggressive data rate. There’s a bunch of choices here summed up by “MCS index”; you might up the error coding from 3 in 4 to 1 in 2, or use 16-QAM instead of 64-QAM, or whatever. I don’t know much about how this works in WiFi but I imagine it’s quite subtle and different from TCP’s algorithms. This paper (I haven’t read) has a description of some of them in practice.

I enjoyed reading this paper about 802.11 reliability. A couple of tidbits I picked up.. Up to half the theoretical bandwidth on an 802.11 connection is consumed by reliability mechanisms, particularly ACKs. Also there’s a paradox that faster links seem to be more reliable. There’s a lot of potential reasons for that, from the tautology that the link is faster because it is more reliable to odd things like faster data transmits mean less time spent flying through the √¶ther and exposed to gremlins.

So that’s the theory. WiFi is designed with a lot of tools to deal with inherently unreliable radio communication. Error correction, retries, rate negotiation, it’s doing a lot.

802.11 diagnostic tools

Despite all that magic Ken was seeing speeds of like 500 kbps on his wireless link, truly awful. The most useful tool I found for measuring this on Ken’s client was simply DSL Reports speed test. That’s testing throughput all the way to an Internet site so is not really a very good test of a local wifi link. But I verified once or twice he was seeing the same slow speed internally, and he was getting a link much slower than our Internet connection. The nice thing about the consumer speed test tool is it shows you momentary bandwidth so you can get a feel for jitter, packet loss, etc. There are many better specialist tools for testing link throughput on your local WLAN; I took the one easiest to put my hands on.

On the server side, I have a fancy UniFi network with a complex bunch of network control software I never use. Turns out it has useful reports though! See Identifying Wi-Fi Issues with Debugging Metrics.

Well that’s your problem right there. According to my UniFi access point, Ken’s computer is receiving at a rate of 1 Mbps. That’d explain the 528 Kbps throughput. Yuck! I never could figure out why it was that low though. Or why it never seemed to increase again, the adapter went to lowest common denominator and got stuck there.

Another useful set of graphs buried in UniFi’s tool are time series charts of the AP’s performance. Graphs like this:

Note that’s a graph from when things are mostly going OK, I don’t have a picture of the problem. But at the time it was not working well the 2Ghz band was showing 80% channel utilization; all for Ken’s crappy 1 Mbps stream. And also lots of dropped and retried packets. Bad news.

I did very the same 802.11n adapter on a second computer also sucked in the same room; so it seemed to be the adapter, not the computer. But 802.11ac was fine. So the solution was simple, spend $25 on a decent wifi adapter and be done with it.

802.11ac, 802.11ax, and friends

A few years back I wrote up a blog post about 802.11n. The learning that went into that served me well. 802.11n is a fairly complex technology and also a very good one; the improvement over 802.11g was enormous. Well now it’s been a few years and 802.11n is old news, 802.11ac is standard and 802.11ax is the new hotness. What are they? (Same caveats as always apply; this is a gentleman amateur’s understanding, mostly based on reading Wikipedia articles.)

One observation; everyone talks about max throughput as the big performance number. But what people really hate about WiFi is when it’s unreliable. I’d much rather have a reliable 100 Mbps connection than a flaky 1 Gbps connection. Related are the number of retransmits that the wireless medium has to do and how that percolates up to dropped packets and unreasonable jitter at the IP level. Right now wireless connections are way, way worse for latency-sensitive applications like videoconferencing or gaming. Fixing that is not the focus of the stuff I’ve read about wifi improvements, but it may still be a happy side effect.

802.11ac

802.11ac is the evolution of 802.11n, a roughly 2013 technology. Sometimes it’s marketed as “WiFi 5”, and also it comes in two major flavors; Wave 1 and Wave 2. To a first approximation 802.11ac is technically just like 802.11n but simply faster. More bits per codeword, wider bandwidth channels, etc. These add up to faster networks with more robust fallbacks, but no major new concepts.

The big change in practice is a lot of 802.11ac connections run at 5GHz (2.4GHz is also supported). 802.11n in theory can run at 5GHz but a lot of the systems only supported it at 2.4GHz. The higher frequency means the signals don’t work as well indoors, but also mean more bandwidth. Even better the 5GHz band has a very generous allotment of channels and a bunch of smart stuff so that it can avoid interference with neighbors, so it works better in dense urban areas.

Nominally an 802.11ac channel is 20MHz wide, but 802.11ac can use up to 160MHz effectively combining 8 channels for 8x the bandwidth. (802.11n was limited to 40MHz). The other big improvements are more MIMO streams (8 vs 4 in 802.11n) and up to 256 symbols per codeword (vs 64 at 802.11n). All those combine for yet higher effective thoughput. In reality most equipment doesn’t support the very highest throughput modes. That’s where the Wave 1 and Wave 2 marketing language comes in; Wave 1 tops out at 1.3 Gbps, Wave 2 at 2.34 Gbps. In reality you’re going to get more like 200-800 Mbps if you’re doing well.

Which brings us to more marketing language. WiFi adapters are sold with language like “AC 1300”. That doesn’t mean 1300 MBps. That means it’ll run up to 400 Mbps at 2.4GHz and 867 Mbps at 5.0 GHz. Even though you only get one or the other, the marketing people add the two numbers to come up with the product description. They also apparently round up, so 400+867=1300. Dumb, but at least it’s clear enough what it means. Your AC 1300 device will never go faster than 867 Mbps. And bigger numbers are better.

Most of 802.11ac’s changes over 802.11n are “go faster”, but there’s a few new concepts too. One is Multi-User MIMO, which somehow lets a router support more bandwidth for a cloud of devices than the maximum throughput just for one device. I think this is spatially based, so that bandwidth to, say, the north of the device is independent to bandwidth to the south.

The other change is beamforming. This lets the equipment do something like a directional antenna, directing signal in the direction it’s trying to communicate. Only you don’t have to move the antenna, instead the electrical elves inside the antenna form the beam by holding up little mirrors. (I may have some of the details wrong but I’m sure my explanation is more comprehensible than the correct one.) I have no idea how well it works in practice.

So that’s 802.11ac. “Faster 802.11n that prefers 5GHz” is a reasonable summary. At this point any new hardware you buy should support 802.11ac. And it may be worth upgrading any old 802.11n gear you have; switching to 5GHz is a nice thing.

802.11ax

802.11ax is the 2019 hotness, also branded WiFi 6. I don’t really know much about it, but to a first approximation it’s yet another improvement beyond 802.11ac. A lot of the changes seem to be about avoiding congestion in denser areas. Smarter dynamic power management, better MIMO options, and the ability to use a wider range of frequencies between 1 and 7 GHz.

The addition of Orthogonal frequency-division multiple access (OFDMA) sounds like a big deal. This is a new way to share spectrum among multiple users which should help contention. I think it’s all coordinated by a single access point so I guess it works for one WLAN, not resolving contention between neighboring WLANs.

Another change is changing guard intervals to a longer period. That sounds like it’d lower bandwidth but benefit longer distance operation. There’s also changes in subcarrier spacing and symbol durations I don’t understand.

It’s not clear what the maximum throughput of 802.11ax will be. People are throwing around the number 10 Gbps, which is phenomenal. According to the wikipedia chart a single stream tops out at about 1200 Mbps, but you can have multiple streams. Be interesting to see what people get in practice.

802.11ad and 802.11ay

I’ve never encountered these but given the cutesy naming (one letter above ac/ax) I include it here. These are 60Ghz protocols, also called WiGig, and I’m curious if they are an evolution of WiMAX. Maybe not; the drawback of 60GHz is not only does it not penetrate walls, but you can only throw it about 1km through the air. So it’s not great for a fixed wireless network. But so much bandwidth! 802.11ad can do 7Gbps, 802.11ay is talking about 176Gbps. Wikipedia talks about this being interesting for use for wireless displays.