Mastodon account migration (Apr 2022)

My old Mastodon instance stopped working. I want to migrate to nelson@tech.lgbt. Unfortunately migration is pretty limited in Mastodon. There is no simple way to move your actual content, your toots. You can move your list of who you follow though, and in some circumstances get your followers to follow your new account. There’s three ways to migrate:

  • Just redirect the old account. This doesn’t actually move any data or set up the new account at all, but it does tell any new visitors you’ve moved.
  • Use the “account migration” tool. This basically does a redirect, but then also “irreversibly force everyone to unfollow your current account and follow your new account”.
  • Migrate manually with data exports and imports.

The first two methods require the server you’re moving from be online and federated. My old server is dead. But I do have a data export archive so I can use that to set up the new account.

What’s in a Mastodon export? It’s a bit messy, there are several files.

  • Who you follow, block, mute. Also lists and bookmarks. These are in separate CSV files you have to download individually. These are the stuff you can usefully import elsewhere.
  • All your toots in an archive, along with a bunch of images. The toots are in a big JSON blob. There’s also some other data like your avatar image. These cannot be imported.

Bottom line, the only thing I can really import usefully is who I follow. And doing that won’t get them to follow me. Mastodon’s blog post says “moving instances is painless and straightforward with Mastodon” but that’s really not true. Not surprised, doing this in a federated system with user security as a top priority is hard.

The import seemed to go smoothly. It’s not instance, the import page says “we’ll get to it eventually”. It seemed to take a few minutes.

The import was not complete, but maybe it takes awhile. My export had 275 accounts I was following. First time I looked after 15 minutes I had about 114 people I was following (according to the new server’s export tool). The second time after 45 minutes it was 177. I don’t expect all 275 of the accounts will ever import, some are from servers that are gone.

I’m starting to get followers again, I suspect because they saw the “nelson followed you” notification. It will be impossible to get most of them over without the old server coming back up.

The nicest thing about my new instance is the skin is now some clean, single column layout of toots. I really hated that old darkmode multicolumn thing!

OpenWRT (RPi4): giving up

I’m giving up on OpenWRT on my RPi4. I’ve spent the last few days getting it working and while it’s mostly good, I don’t trust it. I’m about to go out of town for awhile and I can’t leave my partner with a possibly flaky router, so I’m back to the Ubiquiti router. It’s buggy but I understand what the bugs are and can live with them. OpenWRT is very impressive and would probably work for me, but it requires more tinkering than I want to do. It does not spark joy in me.

The main problem with my OpenWRT setup is the mwan3 failover isn’t working for me. I got woken up by my alarm system loudly beeping, the stupid thing it does when there’s no Internet for a few hours. Starlink had gone down again. That’s not OpenWRTs fault, Dishy itself was out. But the mwan3 failover in OpenWRT didn’t work like it should have.

I could probably diagnose or fix this if I wanted to spend the time. The permission denied error in my logs is probably a good place to start. But I just don’t want to bother. (FWIW failover works if I disable one interface entirely. But that may just be Linux routing working, it has both WAN interfaces as routes. mwan3 is pinging interfaces and logging errors but I’ve never seen it actually mark one bad and switch.)

More generally there’s a bunch of rough edges in OpenWRT. The lack of persistent USB device naming is awkward, and potentially a source of a hard to diagnose problem in the future. The way package upgrades work (or don’t work) makes me nervous. I don’t trust that a full system upgrade will work when the time comes. I don’t like that I had to spend 5-10 hours reading docs and tinkering to get the things set up I want. All in all I want a router to just be an appliance, something I don’t spend time thinking about. OpenWRT is more of a firmware for hackers who want to be thinking about their routers.

I’ve said this before, I’d love something with the ease of use of the old Tomato firmware that’s based on OpenWRT. OpenWRT has improved a lot since 2015, LuCI is a very nice UI. Back then I got a comment that Gargoyle might fit the bill; I see that project is still getting updated, so maybe it’s worth a try. Also maybe my expectations are unreasonable. There’s lots of consumer routers out there that do a basic home LAN just fine, Starlink even gives you one. But I want more; I want multi-wan failover, I want VPN tunneling, I want detailed diagnostic data. It’s asking a lot.

Still I’m glad I spent the time with OpenWRT. I’d definitely reach for it again if I had specialist needs or was in a mindset to be more patient with the tinkering required. It’d also make a fantastic foundation for derived products. I think several commercial routers are sold with some variant of OpenWRT pre-loaded. Rumor has it Starlink’s own router is based on OpenWRT.

OpenWRT + Wireguard (RPi4)

My new, unrewarding hobby of router tinkering continues. This time setting up a Wireguard tunnel between my two houses. I want my two LANs 192.168.0.0/24 and 192.168.3.0/24 to talk to each other transparently via a VPN tunnel. I’d gotten this sorta working between Linux boxes inside the two LANs before but the routing never quite worked. Running a VPN tunnel directly on the router is one reason I’m now trying to set up this RPi4 with OpenWRT.

Long story short, it was very easy to get my OpenWRT router connecting as a client to a working Wireguard server on my other LAN’s Linux box. I can’t fully test this because I don’t have Wireguard setup on the other side’s router yet. But the tunnel seems to be working at least somewhat.

Note to self: my Ubiquiti router at the other house is set up to not answer pings. Ping is not a good test! Ssh is though.

Once again the OpenWRT docs for Wireguard are very good. The basics tell you how to install it including a LuCI interface and this client guide has some extra good info.

It boils down to the same way you set up Wireguard anywhere. Generate a key pair, configure the server to accept the new OpenWRT public key, configure the OpenWRT client to connect to the server and its public key, and you’ve got the tunnel up. It helps to log into the OpenWRT command line and run “wg show” and the like manually, but LuCI actually has a full UI for configuring and monitoring. Note that when you add the interface there’s a Wireguard-specific “Peers” tab you need to configure.

The one big gotcha is the “No Host Routes” option (aka nohostroute) as discussed here; it’s in Interfaces / General Settings. Without this OpenWRT will add a route specifically for the host endpoint specifically on one interface. This interacts poorly with a multi-WAN setup like mwan3. Setting nohostroute tells OpenWRT to not set an explicit route. There’s some concern that if you then configure all traffic on the router to go through the VPN link then Wireguard itself will try to also talk to the Wireguard server via the Wireguard interface, causing some sort of loop or recursion. Not a problem in my case.

In the Peers settings I went ahead and explicitly included Allowed IPs of 192.168.7.0/24 (my VPN endpoint subnet) and 192.168.0.0/24 (the LAN at the other end of the VPN). I’m not sure that’s necessary.

I’m confused about what to put in the Firewall Settings for the wireguard interface. This 2018 guide suggests creating a new zone and configuring rules for it. But then it says “This step is probably optional (you could just add the interface to the lan zone)” and I think that’s probably correct; in general I want to treat the other size of this VPN tunnel as part of my LAN. I definitely don’t want masquerading/NAT for it! I can’t really test this properly until I have a router on the other side set up doing Wireguard. I don’t understand modern Linux firewall rules very well.

All of this is complex and at the edge of my patience for dealing with stuff. OpenWRT and Wireguard are very powerful and flexible tools. More than once I found myself wishing there was some hand-holding simplification for me, a “just press this button to make it work” thing. But then I don’t know how common goals like mine are.

OpenWRT + RPi4: failover and load balancing

I have two ISPs; I want to use Starlink all the time unless it goes down, in which case I want to failover to my fixed wireless. OpenWRT supports this with the MWAN3 package along with load balancing and various forms of traffic-based routing. It’s flexible and a little complicated. The whole reason I am ditching my Ubiquiti router is their long standing bugs in their failover implementation; I’m hoping OpenWRT’s is more reliable.

The MWAN3 docs are excellent but do assume some networking knowledge. Long story short, it’s a set of scripts that implements load balancing and failover by manipulating Linux policy routing. Basically it pings the interfaces and adjusts the routing table appropriately if a connection goes down.

See also my previous post on OpenWRT + RPi4 with a single WAN.

Hardware and device names

I’m using a Raspberry Pi 4 as my router. I’ve purchased two TP-Link UE300 USB ethernet adapters for my WAN ports. They are plugged in to the two blue USB-3 ports. This means I’m going to have two actual hardware ethernet devices for the WAN, eth1 and eth2. (eth0 is the LAN which is the RPi’s on-board ethernet.) Some of the MWAN docs talk about VLANs and switches and devices named like eth1.0; no VLAN needed with the USB hardware.

Unfortunately as far as I can tell OpenWRT has no built-in provision for naming devices consistently. When I was plugging and unplugging things I would see one of my adapters go from being “eth1” to being “eth2” despite it always being plugged into the same port and the MAC address not changing. This is not good, it makes identifying which interface you want to use for what difficult. However I think as long as you don’t hotplug anything the device names are assigned consistently in order. If you look in dmesg there’s messages like these:

usb 2-1: New USB device found, idVendor=2357, idProduct=0601, bcdDevice=30.00
r8152 2-1:1.0 eth1: v1.10.11

USB 2-1 seems to be the name for the upper blue USB port. With two dongles plugged in, USB 2-1 seems to always get named eth1 and USB 2-2 always seems to get named eth2. All bets are off if you unplug and replug things afterwards though.

A better solution would be to name devices by the MAC address or some other (relatively) immutable property of the hardware. I found some notes on doing this online here, here, and here. That last one refers to the OpenWRT “Community Build” for RPi4 but that project looks a bit messy and there’s some drama going on with it so I haven’t tried it.

Anyway, after 5 reboots the device names have been stable so I’m going to stick with this. For my notes:
Starlink, usb 2-1, eth1, WAN1, 5C:A6:E6:AA:BC:9A
SmarterBrodband: usb 2-2, eth2, WAN2, 54:AF:97:5B:F9:63

One last note on naming; OpenWRT has a concept of “interface name” in Networking / Interfaces, a step of indirection from devices names. Ie, I have eth1 named “WAN1” in OpenWRT and a lot of other OpenWRT software uses WAN1 as the name. I ran into a problem because I named my interfaces WAN1/WAN2. The mwan3 scripts default to assuming these names are “wan” and “wanb” and I regret not using those names for simplicity.

Setting up the second ISP

Before you do anything with mwan3 you have to get the second ISP working in OpenWRT. This is as simple as adding it in Network / Interfaces. The “gateway metric” advanced setting is important; Linux will send packets to the gateway with the lowest metric. If two devices have the same metric (which is the default if you don’t edit it) I’m not sure what happens, it looks to only use one of them without mwan3 or some other load balancing configuration.

If you’re using luci-app-statistics this is a good time to go to Statistics / Setup and make sure you have graphs turned on for all of your network interfaces.

Installing and using MWAN3

This is as simple as opkg install mwan3 luci-app-mwan3 (or the LuCI equivalent). However the moment you install mwan3, mwan3 is going to start running and doing routing. The default config is reasonable, it tries to load balance on “wan” and “wanb”. But because I used “wan1” and “wan2” instead as my names nothing worked. Ironically I hadn’t even installed the LuCI interface yet, so without I had to edit config files by hand.

The config file is in /etc/config/mwan3. There is also a “mwan3” command line tool that is very helpful when tinkering.

The LuCI interface is in two places. Network / Load Balancing is where the config is done. Status / Load Balancing is the status page. (Some of the docs refer to an obsolete location for the status page.)

Configuring MWAN3

The docs are great so I won’t try to reproduce them. In summary though you describe interfaces, one per ISP connection. Then you define members which are statements like “wan-m1_m3 uses interface WAN1 with metric 1 and weight 3”. Members are combined into policies; the member that’s currently up and valid and has the lowest metric is the one the policy will use. If there’s two members with the same metric, it will load balance according to the weights. Metrics get combined into policies like “balanced” (use both wan and wanb) or “wan_wanb” (failover; use wanb only if wan is down). Finally you define a set of rules which say “traffic of this type gets sent out this policy”.

It’s a complicated and flexible system. The default config is to load balance the two connections with slightly unequal weights. I just edited the rules to use the “wan_wanb” policy instead of “balanced” to get strict failover behavior instead. (I also edited to accomodate my different naming convention but only for interfaces; I left mwan’s “wan” and “wanb” naming for members, policies, etc).

There’s a lot of subtle failover behavior you can configure, particularly in the interfaces section. That’s where you set things like ping intervals or how many pings have to fail before the interface is considered dead. I’ve left all the defaults in place other than changing the IP addresses it pings. The package uses the docs’ defaults except reliability is set to 2. I think that means the behavior is it tests every 10 seconds. A test is it pings each of 4 hosts once with a 4 second timeout; if 2 of the pings succeed then the test passes. If 5 tests fail in a row the link is considered dead. 5 successes and the link is alive again. There’s a complicated extra set of heuristics for “check_quality” but I believe those are disabled by default.

There’s also a global option to enable syslogging. Turning that on to “notice” gives a reasonable, non-spammy syslog of failed pings and failover.

Does it work?

Yes! So far so good. It all seems a bit complicated for basic failover but once I got going the configuration was really pretty simple.

One gotcha when testing; if you search Google for “what is my ip” it will helpfully show you your IP address. But this might be a little out of date; in some cases you can still have a browser connection open on the backup WAN interface even though new connections are being done on the primary WAN. I think curl http://api.ipify.org/ is a more reliable way to verify what your router is choosing at the moment.

Observed behavior on the testing is a bit more complicated than I thought. The syslog messages from mwan3track talk about a “score” that’s getting incremented faster than the 10 second interval I thought it had.

Basic failover seems to be functioning. I’ll know more in a day or two, I’m curious how well the default heuristic is going to work with Starlink’s “drop out for a few seconds” failure mode as well as the nightly congestion. The worst thing is if the router flaps over to failover mode too often, sometimes it’s better to wait out a transient failure. Will need some tuning no doubt.

I want to give a shout-out to both the LuCI mwan3 UI and the Wiki docs. They’re both quite understandable and nicely designed.

Update: failover didn’t work last night. Around 4am Starlink got a firmware update and rebooted itself. My monitoring tells me my house couldn’t ping 8.8.8.8 for about 4 minutes afterwards. Looking at the syslog it’s clear mwan3 was aware of the outage; it detects that the network lost carrier and talks about setting the link down, then up again in a few seconds. There’s talk about disabling the interface, etc. But there’s no logging about failed pings (which I would expect) and judging by my monitoring, it did not fail over to the backup. There’s also several log lines with errors in them like “netifd: WAN1 (10264): Command failed: Permission denied” which does not inspire confidence.

OpenWRT on Raspberry Pi 4

I’m fed up with my Ubiquiti routers. UnifiOS has a bunch of bugs related to WAN failover that have been there for a year+. Also a lot of mysterious behavior and some odd engineering problems. And still no official solution for Wireguard last I looked. So I want to switch (back) to OpenWRT for routing. I want three ethernet ports in the end; LAN, WAN, and backup WAN. I’m not using the WiFi on the RPi4 at all. I could, but I have hardware with better antennas.

These notes are a work in progress and are not a fully followable cookbook. It took me like 5 hours of tinkering in different orders to get this going. Don’t worry, it’s not normally that hard. Here are some other sources I found helpful for doing this: one, two, three.

Step 1: Hardware and device drivers

OpenWRT runs on all sorts of hardware but I’m going to do this with a Raspberry Pi 4. It’s way more powerful CPUs and RAM than a typical router. The networking is a bit weaker though. The RPi4 gives you one ethernet port on the PCIe bus and the rest have to be USB ports. The TP-Link UE300 is a commonly recommended USB adapter, it has an RTL-8053 which is well supported. My LAN will be on the on-board ethernet, the WAN links on USB. I think this setup is good enough to support gigabit throughput and is certainly good enough for Starlink’s max 200 Mbps. I’m told you can do traffic shaping and VPN stuff with very good speeds using the RPI4 CPU, but I haven’t tested it. (Few consumer routers can do gigabit Internet with these fancies.)

Another interesting hardware option is an RPi CM4 with a router board. That gives you two full Gigabit ethernet ports connected to the PCIe bus. A third option is to only have a single ethernet port and use VLAN tagging so the router can work in a “router on a stick” mode.

Step 2: Basic install

OpenWRT has pretty good docs for running on an RPi 4, I just followed them to get a basic system going. Note that the RPi 4 is way more powerful hardware than OpenWRT is normally aimed at, so some of the extreme things that OpenWRT does to save resources may not be necessary. Some folks recommend using ext4 as the filesystem instead of the compressed SquashFS. There’s also something about resizing the partition to use all the SD card storage. I didn’t bother.

Flash your SD card, then boot the router without the USB ethernet devices plugged in. (Or with, it doesn’t matter, but don’t plug in any WAN cables yet.) Plug a laptop ethernet into the on-board ethernet port. OpenWRT should come up and be serving DHCP on the ethernet. Point your laptop to http://192.168.1.1/ and you’re logged in to LuCI, the OpenWRT web interface. Note that OpenWRT will call this ethernet port eth0 and it will be bridged into br-lan.

Step 3: enabling your USB ethernet device

There’s one hard thing setting this up; OpenWRT doesn’t have the kernel drivers for USB ethernet installed by default. You need the kmod-mii, kmod-usb-net, and kmod-usb-net-rtl8152 opkg packages (IPK files) installed. No configuration needed, but you do have to install the packages. The problem is you’re not connected to the Internet to just install them.

You have several options for getting them installed. My choice was to manually download the needed packages from the OpenWRT repository to my laptop, then upload them via LuCI to the router. (You have to install them one at a time and click “Dismiss” manually on the install popup). It’s also possible to plug a WAN cable into the one working ethernet and get access to the router somehow (keyboard and screen? routing tricks?) to install them. I like the “2nd way” in the linked Reddit post; you make a custom OpenWRT image that included the necessary drivers.

Step 4: configuring the WAN

Now that there’s an ethernet device for the WAN you have to configure it. Do this in LuCI; go to Network / Interfaces and add an interface for eth1. Call it WAN and then go to firewall settings and put it in the WAN zone. This is a little mysterious but I think that’s all you have to do. I rebooted somewhere in here just to be sure that it was all working right.

Step 5: Improve OpenWRT

The blessing (and curse) of OpenWRT is it’s a very flexible Linux system and you can do a lot with it. A search for packages matching “luci-app-*” is a good way to see what’s available. Here’s some of what I’ve done.

Update the software lists and manually update each package one at a time. The updates aren’t exactly recommended but before I did that some new LuCI stuff I installed broke the old LuCI install until I updated it.

Install some quality of life packages: less, bash, curl, nano, mg (for me).

Install luci-app-statistics. You also have to enable this by going to Statistics / Setup and saving the config.

TODO / options

Some things I have not yet done but may get around to (and update this blog post if so)

Install mwan3 for failover. Done, see this post for my notes.

Install wireguard for a VPN. Done, see this post for my notes.

See if I can export monitoring stats somehow to my InfluxDB instance for plotting in Grafana.

Install luci-app-sqm for traffic shaping

My own notes

Some notes of interest to me only.

The network interfaces are:
LAN eth0/br-lan E4:5F:01:5F:AB:D8
WAN1 eth1 5C:A6:E6:AA:BC:9A
Wireless not enabled

System / logging / external system log server to my Linux server on the LAN

DHCP range: 192.168.3.2 – 192.168.3.199

Override DNS servers and use 8.8.8.8 and 8.8.4.4

Set a static route for Starlink for 192.168.100.1 / 255.255.255.255 on eth1.

Some static leases (all have a lease time of 86400)
gvl 94:c6:91:1e:c8:38: .75
printer 84:25:19:0e:7b:0b: .67
pvspi0 e4:5f:01:78:d7:f9: .140

Starlink DHCP vs UnifiOS

My Starlink connection will work for days, then get in a mode where every 5 minutes the connection will fail for 30 seconds before it comes back. Long story short it looks like Starlink has a problem where it’s not responding to DHCP address renewals sometime.

Complicating matters; this is a Ubiquiti USG-3P router running UnifiOS and I have a backup WAN configured. Ubiquiti’s backup WAN stuff is a bit flaky, and also the failover does tend to both hide connection problems (when it fails over) and cause some (the failover itself breaks existing sessions.)

I’ve opened a support ticket for this, TIK-213261-55007-82. And I got two replies!

Apr 12: our engineering team is aware of cases where customers are having WAN DHCP renewal issues and the 5 minute lease expires. Our DHCP server engineers are actively working on this and I did send a link to the blog post to the lead engineer, appreciate the write up. Apologies for the issues and please bear with us.

Apr 15: our network software team deployed two fixes last night to hopefully resolve the DHCP renew issue

I finally got some DHCP logging set up to understand what’s going on at the router when the problem happens. Details below, mostly for my own troubleshooting.

The USG-3P has three ethernet ports. They are

  • Starlink, the primary WAN: on eth0, typically at an IP like 100.74.162.37.
  • My LAN: on eth1, 192.168.3.1
  • SmarterBroadband, the backup WAN2: on eth2, 10.33.1.1

My understanding of the DHCP behavior is that when connected, Starlink gives out an IP address in the CGNAT block 100.64.0.0/10 with an expiration of 5 minutes. Ubiquiti’s DHCP client accepts this address and binds to it. It also renews it every 2.5 minutes, half the lease time. If the renewal doesn’t happen in that 2.5 minutes then the link is effectively dead, packets to it fail, and the WAN failover process starts. (DHCP clients are not allowed to keep using an address past the lease expiration!) I’m not certain how the DHCP request is answered by Starlink, but my guess is the DHCP request goes up to the satellite and down to a ground station in Seattle where Starlink answers it. My working theory is those DHCP requests aren’t getting answered.

Update: there are reports that Starlink is starting to serve 1 hour DHCP leases. (Later they say it went back to 5 minutes; maybe a temporary change?)

Observed

When things are going well, every 2-3 minutes there’s an entry like this in the log:

2022-04-09T20:30:07,108 <30>Apr  9 13:30:07 ubnt dhclient: bound to 100.74.162.37 -- renewal in 138 seconds.
2022-04-09T20:32:25,233 <30>Apr  9 13:32:25 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-09T20:32:25,292 <30>Apr  9 13:32:25 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-09T20:32:29,566 <30>Apr  9 13:32:29 ubnt dhclient: bound to 100.74.162.37 -- renewal in 142 seconds.

I believe this is an ordinary DHCP renewal process. The router has an address so it just sends a DHCPREQUEST to request the address again. The server answers back with a DHCPACK and everyone continues along.

When things are going badly, it looks like this:

2022-04-12T15:22:11,377 <30>Apr 12 08:22:11 ubnt dhclient: bound to 100.74.162.37 -- renewal in 117 seconds.
2022-04-12T15:24:08,373 <30>Apr 12 08:24:08 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:19,743 <30>Apr 12 08:24:12 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:20,862 <30>Apr 12 08:24:20 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:39,539 <30>Apr 12 08:24:27 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:47,472 <30>Apr 12 08:24:47 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:59,338 <30>Apr 12 08:24:57 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:07,502 <30>Apr 12 08:25:07 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:23,323 <30>Apr 12 08:25:23 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:43,693 <30>Apr 12 08:25:43 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:57,522 <30>Apr 12 08:25:57 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:12,669 <30>Apr 12 08:26:07 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:17,713 <30>Apr 12 08:26:17 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:28,982 <30>Apr 12 08:26:28 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:26:39,022 <30>Apr 12 08:26:39 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:26:56,422 <30>Apr 12 08:26:56 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:27:10,770 <30>Apr 12 08:27:10 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
2022-04-12T15:27:10,835 <30>Apr 12 08:27:10 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:27:10,835 <30>Apr 12 08:27:10 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T15:27:10,910 <30>Apr 12 08:27:10 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-12T15:27:16,963 <30>Apr 12 08:27:16 ubnt dhclient: bound to 100.74.162.37 -- renewal in 134 seconds.

That sure looks to me like Starlink is just not answering the DHCP request. An abbreviated timeline:

  • 15:22:11,377: the router thinks it has a valid address. It should have an expiration of 300 seconds and therefore be renewed in 150 seconds. But it takes awhile for it to arrive, so the actual renewal is scheduled for 117 seconds. (This delay seems normal.) The router should send a DHCPREQUEST to renew the address at 15:24:08. The address will expire around 15:26:38 or maybe a little earlier.
  • 15:24:08,373: the router sends the first DHCPREQUEST to renew its address. It sends it to 100.64.0.1, presumably the IP address at the other end of the Starlink connection.
  • 15:24:08,373 to 15:26:17,713: the router hasn’t gotten a DHCPACK reply. So it keeps sending requests every 10-20 seconds.
  • 15:26:28,982 to 15:26:56,422: still no reply. Now the router starts sending DHCPREQUEST to 255.255.255.255. I think that’s likely because the 100.74.162.37 lease expired and so the router now has no idea what address is at the other end and it’s starting over with a broadcast.
  • 15:27:10,770: the router sends a DHCPDISCOVER. I’m not certain but I think this means the router has given up on the old address entirely and is starting over as if it has never heard of the network.
  • 15:27:10,835: the router receives a DHCPOFFER from Starlink. We’re back in contact! This took 65ms, or just about one round trip on the satellite link.
  • 15:27:10,835: the router sends a DHCPREQUEST, presumably securing the address it was just offered. (Note the log has the request coming before the offer, but I think the log messages just got swapped. They have the same millisecond timestamp.)
  • 15:27:10,910: the router receives a DHCPACK from Starlink, 75ms after the request. We’re connected again!
  • 15:27:16,963: the router logs that the IP address is bound. Not sure why this takes 6 seconds to announce, but that’s all happening inside UnifiOS.

I should note that in all 6 failures I’ve had this morning (once every 5 minutes), only one DHCPDISCOVER has been needed to get the connection going again. Each time one is sent it gets immediately answered. There’s a different kind of outage I’ve seen on previous days, where the router eventually does a DHCPDISCOVER and gets back the address 192.168.100.100 with a very short 5 second lease. That’s Dishy’s internal DHCP server answering with a local only address, a mode it gets into when it loses its link to Starlink’s ground station entirely. That’s a very old behavior I’ve discussed on this blog before. This every-5-minutes outage is relatively new, a few months old.

Interpretation

My interpretation of this is Starlink’s DHCP server has forgotten we had a lease. My router keeps trying to renew the lease with DHCPREQUEST and gets no answer. Finally the router starts over wtih a DHCPDISCOVER and Starlink responds immediately. That’s consistent with the satellite link basically working but Starlink’s DHCP server failing to renew my address. But I’m guessing here.

What I do know is it’s really annoying having your connection drop every 5 minutes. The WAN failover is kind of making things worse; it takes UnifiOS about 30 seconds to decide to failover. By the time it’s done that Starlink has come back up and so the router switches back to it. Basically every 5 minutes the network is down for about 30 seconds. Any TCP sessions with activity during the failover period get reset.

But the failover isn’t causing the problem, the lack of DHCP response is the real problem.

Updates / notes

A useful command for checking the logs for Starlink-related DHCP stuff

grep dhclient 192.168.3.1.log | grep -v eth2 | fgrep -v 10.33.1.1

Just saw a bizarre failure that I think must be a Unifi OS bug.

2022-04-12T23:54:16,623 <30>Apr 12 16:54:16 ubnt dhclient: bound to 100.74.162.37 -- renewal in 117 seconds.
2022-04-12T23:55:40,224 <86>Apr 12 16:55:40 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/dhclient -q -cf /var/run/dhclient_eth0.conf -pf /var/run/dhclient_eth0.pid -lf /var/run/dhclient_eth0.leases -r eth0
2022-04-12T23:55:40,292 <30>Apr 12 16:55:40 ubnt dhclient: DHCPRELEASE on eth0 to 100.64.0.1 port 67
2022-04-12T23:55:44,280 <86>Apr 12 16:55:44 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/rm -f /var/run/dhclient_eth0.pid
2022-04-12T23:55:48,006 <86>Apr 12 16:55:47 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/dhclient -q -nw -cf /var/run/dhclient_eth0.conf -pf /var/run/dhclient_eth0.pid  -lf /var/run/dhclient_eth0.leases eth0  2> /dev/null &
2022-04-12T23:55:48,213 <30>Apr 12 16:55:48 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
2022-04-12T23:55:48,242 <30>Apr 12 16:55:48 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:55:48,242 <30>Apr 12 16:55:48 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T23:55:55,691 <30>Apr 12 16:55:55 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:56:04,640 <30>Apr 12 16:56:04 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
2022-04-12T23:56:11,101 <30>Apr 12 16:56:11 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 13
2022-04-12T23:56:11,164 <30>Apr 12 16:56:11 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:56:11,164 <30>Apr 12 16:56:11 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T23:56:11,219 <30>Apr 12 16:56:11 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-12T23:56:17,169 <30>Apr 12 16:56:17 ubnt dhclient: bound to 100.74.162.37 -- renewal in 134 seconds.

This looks like the router just decided to restart dhclient. And the first thing it did was a DHCPRELEASE, breaking my connection. I have no idea why the router did this but it looks like Starlink responded quickly to the requests the router was making.

Measuring ISP congestion without load tests

My Starlink seems congested in the evenings. What are some strategies for measuring the congestion? The simple one is to run a speed test, see below for details on that. But speed tests are wasteful. Is there some lower impact monitoring I can do? This post is a bunch of ideas and questions, no solution.

Pings

My first thought was just to measure ping latency and packet loss; if the ISP is congested then latencies go up (queueing) and packets start getting dropped. Pings are a lot less traffic than load tests and there’s a lot of software to do them. This sort of works but it’s pretty noisy and pretty far removed from what I really care about, which is “can I use the Internet to do stuff?”

It might be smarter ping / packet loss measurements could work. My existing ping is already a little clever; every 15 seconds Telegraf sends 5 ICMP pings at once and reports what percentage was lost and a latency distribution. Packet Loss Test has an interesting take on measuring latency and jitter with WebRTC (in the browser!). iperf’s bursts of UDP packets are also a good way of measuring instantaneous throughput and could be done without using too much bandwidth. Ping probing is an idea worth exploring.

A related idea I wrote up is using some sort of traceroute-like probing to not only detect congestion but guess where it’s happening in the network. I think this is probably impractical in the general case, but maybe a custom solution for Starlink’s infrastructure could be helpful.

Passive monitoring

My next thought was to measure TCP retransmits at the router for existing traffic as it goes by. They’re a primary cause of low bandwidth and a primary signal of congestion. This should be doable in theory with stateful packet inspection in the router but I’ve never seen any software that does it. Anyone know about it? I also think the resulting data might be pretty indirect.

The other solution would be to measure TCP retransmits at the client endpoint. This is easy in Linux, probably not hard in Windows, but probably impossible on, say, my Roku streaming video device. It sounds like a significant hassle.

Any passive monitoring has a problem that you aren’t really sure if the congestion is on your client end or some problem on the server end. Load testing services presumably solve this by being well provisioned. But I suspect in practice most of my TCP traffic is going to very well connected sites like Youtube or Cloudflare and any slowness is more likely on my end than theirs.

Load test results

Running load tests is certainly the easiest thing and I’ve been doing that for months. Here’s a week of running an Ookla speedtest every 15-60 minutes:

The average of 130 is great. But every evening it dips to well below 50 Mbps. Worst measured is around 10Mbps. This testing shows the congestion problem pretty well.

The problem is load tests are expensive: each one download 250 megabytes of data over about 20 seconds (when things are fast). When I was running 4 an hour the speed tests were a significant fraction of my overall daily bandwidth usage, maybe 30% or more. They were making the video streaming worse. I’ve cut that back to once an hour but it’s still a waste.

Community monitoring

One bad thing about load testing is that it’s selfish; only I get the results of all that wasted bandwidth. Starlink Status tries to do better, by collecting stats from many users and sharing them. It’s a great idea but the execution is a bit lacking.

What would be really great is if Starlink were transparent and published its own traffic and congestion stats. Then I could just look to see how they’re doing. A few independent users could run their own tests to verify they aren’t misleading people. It’d be cool but I’ve never heard of an ISP volunteering this kind of competitive data.

Home LAN disaster mystery

I truly b0rked my home network yesterday. It took more than an hour to fix. I still don’t understand what went wrong but I’m wondering if my ethernet switches got in a bad state.

It started when I tried swapping my Ubiquiti router out for an OpenWRT router. The new router was working OK, the whole network was fine. I installed a few packages to OpenWRT (including wireguard, unconfigured) and rebooted it and then my network was down. No DHCP, couldn’t even connect with a static IP, OK I figured OpenWRT somehow broke and so swapped back to the old Ubiquiti router.

That’s when the mystery started. Nothing worked with the old router plugged back in. Again no DHCP, static IPs not working. Lots of blinking lights on the switches though, suggesting maybe some networking was still happening. I tried power cycling the router and the switches in the path between my computer and the router. No luck.

So then I began a tedious process of getting the network working starting from the router. Unplugged everything but one laptop. Still no DHCP. But static IP worked. That’s when I learned the router’s DHCP server wasn’t running. It was before! Not sure if UnifiOS “lost” the DHCP server setting? The Unifi controller was offline but my understanding is the router should keep its last config, which should have included DHCP service. Weird. So that’s one mystery, why did my router stop doing DHCP?

Fortunately the minimal configuration UI on the router itself does let you turn DHCP on. Turning that on and I had one router working with one laptop. And soon after that, one switch plugged into the router. But the Unifi access point plugged into the switch still didn’t seem to be working; connecting via WiFi got no DHCP. Weird.

But it got weirder when I went to the second switch, the big one at the center of my LAN and plugged into the first switch near the router. Nothing plugged into it worked. The ethernet cable coming to that switch worked on the laptop, but plug that into the switch and then the laptop into the switch and boom, no working ethernet. I finally just replaced the switch entirely and suddenly everything started working. Including the Unifi access points.

Today I tested the suspect switch and it seems fine. Also the router seems fine. So that’s another mystery, why did this equipment look like it wasn’t working after rebooting a router and works fine today?

One possible explanation for all of this is that somehow my ethernet switching got in a bad state. And that state was somehow polluting the network so nothing was working. I don’t think this is a very good theory. The switches are simple unmanaged Netgear GS108s, pretty bulletproof. No persistent state at all and nearly stateless entirely. I did try power cycling the switches a few times to no effect.

There is a bit of state in ethernet switches though: the spanning tree protocol. Also some sort of memory of which ethernet devices are plugged in to which port. Could that somehow have gotten corrupted?

My network is pretty simple. There’s no physical loops in the network topology (other than that implied by wifi). I’m not using VLANs at all, although I do note OpenWRT has some default VLAN configuration that I don’t understand and didn’t alter. There’s a total of maybe 5 unmanaged switches and 60 devices.

Honestly I’m just speculating about magical explanations. I don’t really know. I do know I’m scared to try swapping routers again.

Update: I tried the OpenWRT router again and failed, although not as catastrophically. It worked fine on first setup, then I rebooted it to test it and suddenly not only did I not have Internet, but the router wasn’t answering anything at all. No idea what’s going on. One guess is installing the wireguard packages with opkg somehow caused a problem, even though I didn’t configure wireguard. It also looks like Starlink’s Dishy terminal is not immediately answering DHCP requests from the router the moment it asks, so that’s not great. That doesn’t explain why I couldn’t even talk to the router at all though.

Google Takeout notes

I finally got my Google Takeout archives downloaded, the most important of my cloud backups. I tend to prefer Google for my services so a lot of stuff is there dating back 20 years. Google Takeout is an excellent product. What’s in the dump?

I chose to download Takeout’s default: everything but Access Log Activity. It came to about 67GiB of zip files, 72GiB uncompressed.

The top level of the dump includes a nice HTML explanation of what’s in it. With per-product documentation for the types of files you’ll find there and what formats they might be in.

Here’s info on some of the meaningful data I found that I was expecting to see.

  • 9GiB: Mail. The biggest and most precious of all my Google data. All stored in a single .mbox file, which is awkward but not an unreasonable choice. The file is not date sorted. The volume per year is confusing: 2012 and 2020 are big at 13,000+ messages. 2019 I only had 6000. Maybe it’s spam related. There’s also a few extra data files for things like filters, blocked users, etc.
  • 35GiB: Photos. I’m a heavy user of Google photos. Contains both original and edited images, also metadata in a simple JSON format. A little confused at the directory structure; many of the files are in folders like “Photos from 2020” but some are in per-Album directories, I think there are duplicates.
  • 23GiB: Drive. If you asked me I’d tell you I didn’t use Google Drive. I have no idea how this got so big. The useful stuff there is copies of my Google Docs; spreadsheets mostly. The big stuff is a bunch of photos I then imported to Google Photos, I could probably delete them from Drive. It’s a very random and poorly organized collection of stuff.
  • 2.5GiB: YouTube. All the content I’ve created (videos, comments). But also detailed watch and search histories going back 11 years.
  • 0.3GiB: Groups. MBOX format archives of Google Groups I’m an admin for.
  • 0.1GiB: Contacts. VCF format contact lists.
  • 0.1GiB: Calendar. ICS format, single file.
  • 0.3GiB: Location History. JSON files tracking my movements, used for Google Timeline.

And some of the less interesting or accidental stuff.

  • 0.3GB: My Activity, Google Pay. The biggest surprise to me; Google records meticulous details on when I use specific products, there’s online version of the product here. Goes back at least 10 years and includes Android apps, details of what Google Map views I’ve looked at, credit card transactions, YouTube video views, every search query for two years. It’s all stored in a generic format that seems to apply across Google products. Also the dump is an absolutely terrible HTML format with like 4KB of styled HTML per record. Example of a record for an Android app launch:
  • 0.1GiB: Maps. Data spread out over several directories. Bookmarked places, some KMZ files for custom maps I made.
  • 0.4GiB: Location History. Google’s recorded where my phone has been since I first installed Google Maps. They have a nice history browser for this, I also built my own visualizer product for the data. I really like having it but I think most people would find it surprising and creepy Google keeps this.
  • 0.7GiB: Google Play Games Services, Google Play Store, Android Device Configuration Service, Recorder. Stuff related to my Android phone, including a record of every version of every app I’ve installed and some saved game state.
  • 0.8GiB: Nest, Google Home. Stuff about my thermostat. Including 2 years of detailed temperature readings, etc from my house.
  • 0.6GiB: Blogger. I forgot I ever had a Blogspot blog but google didn’t. Also records of my comments on a lot of other blogs.
  • 0.6GiB: Voice. I have a Google Voice number I basically never use. But it gets spam voicemails, for which sound files and text transcripts have been saved for 7+ years.
  • 0.4GB: Google Account, Profile. A record of a year of explicit logins (as opposed to passive authentication). Love the inactive account emails I apparently wrote a few years ago: “What a horrible thing, but apparently I’m no longer able to access my Google account which means I’m likely dead or incapacitated.”
  • 0.1GB: Chrome. I don’t use Chrome much so this is very small. Among other things it contains a history of visited URLs going back 3 months.
  • 0.1GB: Hangouts. An archive of some GChat messages from 2017?

I’m pretty sanguine about all this data. I want Google to be keeping a lot of data for me and I trust them to be careful caretakers of it. Some of it is incredibly useful; I was really excited when I learned Google Maps had my location history, for instance. Google does a reasonable job letting you control just what you can track and I really appreciate being able to download the data.

The My Activity stuff is the one thing that made me nervous. Partly the awful format is coloring my impression. (There’s several projects on GitHub that parse and analyze it.) But also they’re storing a lot of sensitive data in a generic way that’s not sensitive to the particular app. I don’t really care they have a record of my credit card transactions, but I am a bit nervous at this second record of my Google Maps views or the exact times and places that I launched Grindr. I believe Google keeps this data in this format for security audits.

Data export from cloud services

Goodreads lost all my data, writeup here. This hammered home the value of data export services for making my own copy of stuff. I had an export from last July so the Goodreads loss wasn’t entirely catastrophic. Some notes on other data export services.

Data export is obviously a competitive problem for services. If I can’t get my data I can’t easily migrate to a different service. So evil companies want to keep you locked in. Consequently the data export products are often not very good. Countering that are laws like the GDPR and CCPA which give consumers rights to request their data. I’m finding most of the big services I care about do have a reasonable export service. A good hacker spirit also still pervades in some tech companies, they build export because they think it’s the right thing for users.

Google Takeout is the gold standard. The Data Liberation Front did pioneering work in convincing a company to provide export tools many years ago. The tool now works very well. I’m particularly impressed there’s a way to set up a new data export every two months; all the other services are one offs.

Most of these data export services can be slow. Various nonsense explanations (“for your security”). I cynically expect companies keep them slow to be awkward. But one insider I asked confirmed that they can also be slow for legitimate technical reasons. If you’ve got, say, a real-time messaging system then scraping back for 15 years of message history can be pretty hard on the datastore. Backup requests are seldom time critical so built-in throttles make some sense.

It’s a big problem to actually do anything useful with the data dump once you get it. Some of the services (Twitter) include little webapps to at least sorta browse the data. Some just give plain CSV or JSON dumps and you’re on your own. I was hoping to find a diversity of wonderful open source software that could consume these exports. But in retrospect of course not; building those is a lot of damn work.

There’s less organized community around data exports than I’d hoped to find. The IndieWeb world is a good starting point. Datasette is one general purpose data viewer; Dogsheep is a nice collection of tools to import various formats in to sqlite for Datasette. HPI is another interesting general purpose viewer toolset. Also FreeYourStuff which seems to include some scrapers for sites without export tools.

Some of the exports reveal a surprising amount of extra data I didn’t know the sites had. Feedly has a list of everything I’ve read with dates, for instance, which is kinda neat! Facebook has a huge amount of stuff, some alarming, it would take awhile to comb through what all they’ve collected on me. “apps_and_websites_off_of_facebook” seems to be surveillance capitalism in action.

A thread I need to follow up on is what the IndieWeb kids call POSSE: Publish (on your) Own Site, Syndicate Elsewhere. Make it so you own a copy of all the data as you create it. It makes a lot of sense, Cory Doctorow has written about how he does that. It’s a lot of work.

I spent some time today pulling backups for data of all the sites I could think to check. I’ll update this list as I discover more. It’d be awesome to automate refreshing these backups every month or three but given all the security and performance issues that seems difficult.

2856    ./pinboard
56696   ./23andme
601960  ./twitter
4524    ./metafilter
5772    ./feedly
744212  ./facebook
316     ./goodreads
24      ./letterboxd
1356    ./wordpress
8       ./reddit
4       ./google
8       ./yelp
8       ./amazon