Starlink DHCP vs UnifiOS

My Starlink connection will work for days, then get in a mode where every 5 minutes the connection will fail for 30 seconds before it comes back. Long story short it looks like Starlink has a problem where it’s not responding to DHCP address renewals sometime.

Complicating matters; this is a Ubiquiti USG-3P router running UnifiOS and I have a backup WAN configured. Ubiquiti’s backup WAN stuff is a bit flaky, and also the failover does tend to both hide connection problems (when it fails over) and cause some (the failover itself breaks existing sessions.)

I’ve opened a support ticket for this, TIK-213261-55007-82. And I got two replies!

Apr 12: our engineering team is aware of cases where customers are having WAN DHCP renewal issues and the 5 minute lease expires. Our DHCP server engineers are actively working on this and I did send a link to the blog post to the lead engineer, appreciate the write up. Apologies for the issues and please bear with us.

Apr 15: our network software team deployed two fixes last night to hopefully resolve the DHCP renew issue

I finally got some DHCP logging set up to understand what’s going on at the router when the problem happens. Details below, mostly for my own troubleshooting.

The USG-3P has three ethernet ports. They are

  • Starlink, the primary WAN: on eth0, typically at an IP like 100.74.162.37.
  • My LAN: on eth1, 192.168.3.1
  • SmarterBroadband, the backup WAN2: on eth2, 10.33.1.1

My understanding of the DHCP behavior is that when connected, Starlink gives out an IP address in the CGNAT block 100.64.0.0/10 with an expiration of 5 minutes. Ubiquiti’s DHCP client accepts this address and binds to it. It also renews it every 2.5 minutes, half the lease time. If the renewal doesn’t happen in that 2.5 minutes then the link is effectively dead, packets to it fail, and the WAN failover process starts. (DHCP clients are not allowed to keep using an address past the lease expiration!) I’m not certain how the DHCP request is answered by Starlink, but my guess is the DHCP request goes up to the satellite and down to a ground station in Seattle where Starlink answers it. My working theory is those DHCP requests aren’t getting answered.

Update: there are reports that Starlink is starting to serve 1 hour DHCP leases. (Later they say it went back to 5 minutes; maybe a temporary change?)

Observed

When things are going well, every 2-3 minutes there’s an entry like this in the log:

2022-04-09T20:30:07,108 <30>Apr  9 13:30:07 ubnt dhclient: bound to 100.74.162.37 -- renewal in 138 seconds.
2022-04-09T20:32:25,233 <30>Apr  9 13:32:25 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-09T20:32:25,292 <30>Apr  9 13:32:25 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-09T20:32:29,566 <30>Apr  9 13:32:29 ubnt dhclient: bound to 100.74.162.37 -- renewal in 142 seconds.

I believe this is an ordinary DHCP renewal process. The router has an address so it just sends a DHCPREQUEST to request the address again. The server answers back with a DHCPACK and everyone continues along.

When things are going badly, it looks like this:

2022-04-12T15:22:11,377 <30>Apr 12 08:22:11 ubnt dhclient: bound to 100.74.162.37 -- renewal in 117 seconds.
2022-04-12T15:24:08,373 <30>Apr 12 08:24:08 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:19,743 <30>Apr 12 08:24:12 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:20,862 <30>Apr 12 08:24:20 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:39,539 <30>Apr 12 08:24:27 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:47,472 <30>Apr 12 08:24:47 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:24:59,338 <30>Apr 12 08:24:57 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:07,502 <30>Apr 12 08:25:07 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:23,323 <30>Apr 12 08:25:23 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:43,693 <30>Apr 12 08:25:43 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:25:57,522 <30>Apr 12 08:25:57 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:12,669 <30>Apr 12 08:26:07 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:17,713 <30>Apr 12 08:26:17 ubnt dhclient: DHCPREQUEST on eth0 to 100.64.0.1 port 67
2022-04-12T15:26:28,982 <30>Apr 12 08:26:28 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:26:39,022 <30>Apr 12 08:26:39 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:26:56,422 <30>Apr 12 08:26:56 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:27:10,770 <30>Apr 12 08:27:10 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
2022-04-12T15:27:10,835 <30>Apr 12 08:27:10 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T15:27:10,835 <30>Apr 12 08:27:10 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T15:27:10,910 <30>Apr 12 08:27:10 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-12T15:27:16,963 <30>Apr 12 08:27:16 ubnt dhclient: bound to 100.74.162.37 -- renewal in 134 seconds.

That sure looks to me like Starlink is just not answering the DHCP request. An abbreviated timeline:

  • 15:22:11,377: the router thinks it has a valid address. It should have an expiration of 300 seconds and therefore be renewed in 150 seconds. But it takes awhile for it to arrive, so the actual renewal is scheduled for 117 seconds. (This delay seems normal.) The router should send a DHCPREQUEST to renew the address at 15:24:08. The address will expire around 15:26:38 or maybe a little earlier.
  • 15:24:08,373: the router sends the first DHCPREQUEST to renew its address. It sends it to 100.64.0.1, presumably the IP address at the other end of the Starlink connection.
  • 15:24:08,373 to 15:26:17,713: the router hasn’t gotten a DHCPACK reply. So it keeps sending requests every 10-20 seconds.
  • 15:26:28,982 to 15:26:56,422: still no reply. Now the router starts sending DHCPREQUEST to 255.255.255.255. I think that’s likely because the 100.74.162.37 lease expired and so the router now has no idea what address is at the other end and it’s starting over with a broadcast.
  • 15:27:10,770: the router sends a DHCPDISCOVER. I’m not certain but I think this means the router has given up on the old address entirely and is starting over as if it has never heard of the network.
  • 15:27:10,835: the router receives a DHCPOFFER from Starlink. We’re back in contact! This took 65ms, or just about one round trip on the satellite link.
  • 15:27:10,835: the router sends a DHCPREQUEST, presumably securing the address it was just offered. (Note the log has the request coming before the offer, but I think the log messages just got swapped. They have the same millisecond timestamp.)
  • 15:27:10,910: the router receives a DHCPACK from Starlink, 75ms after the request. We’re connected again!
  • 15:27:16,963: the router logs that the IP address is bound. Not sure why this takes 6 seconds to announce, but that’s all happening inside UnifiOS.

I should note that in all 6 failures I’ve had this morning (once every 5 minutes), only one DHCPDISCOVER has been needed to get the connection going again. Each time one is sent it gets immediately answered. There’s a different kind of outage I’ve seen on previous days, where the router eventually does a DHCPDISCOVER and gets back the address 192.168.100.100 with a very short 5 second lease. That’s Dishy’s internal DHCP server answering with a local only address, a mode it gets into when it loses its link to Starlink’s ground station entirely. That’s a very old behavior I’ve discussed on this blog before. This every-5-minutes outage is relatively new, a few months old.

Interpretation

My interpretation of this is Starlink’s DHCP server has forgotten we had a lease. My router keeps trying to renew the lease with DHCPREQUEST and gets no answer. Finally the router starts over wtih a DHCPDISCOVER and Starlink responds immediately. That’s consistent with the satellite link basically working but Starlink’s DHCP server failing to renew my address. But I’m guessing here.

What I do know is it’s really annoying having your connection drop every 5 minutes. The WAN failover is kind of making things worse; it takes UnifiOS about 30 seconds to decide to failover. By the time it’s done that Starlink has come back up and so the router switches back to it. Basically every 5 minutes the network is down for about 30 seconds. Any TCP sessions with activity during the failover period get reset.

But the failover isn’t causing the problem, the lack of DHCP response is the real problem.

Updates / notes

A useful command for checking the logs for Starlink-related DHCP stuff

grep dhclient 192.168.3.1.log | grep -v eth2 | fgrep -v 10.33.1.1

Just saw a bizarre failure that I think must be a Unifi OS bug.

2022-04-12T23:54:16,623 <30>Apr 12 16:54:16 ubnt dhclient: bound to 100.74.162.37 -- renewal in 117 seconds.
2022-04-12T23:55:40,224 <86>Apr 12 16:55:40 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/dhclient -q -cf /var/run/dhclient_eth0.conf -pf /var/run/dhclient_eth0.pid -lf /var/run/dhclient_eth0.leases -r eth0
2022-04-12T23:55:40,292 <30>Apr 12 16:55:40 ubnt dhclient: DHCPRELEASE on eth0 to 100.64.0.1 port 67
2022-04-12T23:55:44,280 <86>Apr 12 16:55:44 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/rm -f /var/run/dhclient_eth0.pid
2022-04-12T23:55:48,006 <86>Apr 12 16:55:47 ubnt sudo:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/sh -c /sbin/dhclient -q -nw -cf /var/run/dhclient_eth0.conf -pf /var/run/dhclient_eth0.pid  -lf /var/run/dhclient_eth0.leases eth0  2> /dev/null &
2022-04-12T23:55:48,213 <30>Apr 12 16:55:48 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
2022-04-12T23:55:48,242 <30>Apr 12 16:55:48 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:55:48,242 <30>Apr 12 16:55:48 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T23:55:55,691 <30>Apr 12 16:55:55 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:56:04,640 <30>Apr 12 16:56:04 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
2022-04-12T23:56:11,101 <30>Apr 12 16:56:11 ubnt dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 13
2022-04-12T23:56:11,164 <30>Apr 12 16:56:11 ubnt dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
2022-04-12T23:56:11,164 <30>Apr 12 16:56:11 ubnt dhclient: DHCPOFFER from 100.64.0.1
2022-04-12T23:56:11,219 <30>Apr 12 16:56:11 ubnt dhclient: DHCPACK from 100.64.0.1
2022-04-12T23:56:17,169 <30>Apr 12 16:56:17 ubnt dhclient: bound to 100.74.162.37 -- renewal in 134 seconds.

This looks like the router just decided to restart dhclient. And the first thing it did was a DHCPRELEASE, breaking my connection. I have no idea why the router did this but it looks like Starlink responded quickly to the requests the router was making.