Ubiquiti Security Gateway failures

Trying to diagnose a problem with my Grass Valley network where the Internet goes down 20-120 seconds every few hours. I think this is an old problem that dates back to my old router (Ubiquiti EdgeMAX) but it may be related to the new USG or the LTE failover device I recently installed, even though it’s now disabled.

The symptom is the whole Internet goes down. My router logs this message:

WAN iface [eth0] transition to state [active] 11:33 am 2020-10-05
WAN iface [eth0] transition to state [inactive] 11:32 am 2020-10-0

I’m able to reproduce that the Internet is really down with a simple looping ping script. But that just tells me the Internet is down, not why. My question is why is the router logging that message?

Related: this webpage has excellent docs on how Ubiquiti’s failure detection and failover works.

Theories

My first theory was that my WISP is just flaky and that’s to be expected. I swear I’ve had problems like this for years. But I had a long chat with them yesterday and they assured me that’s not expected behavior. Also that my device had a clear strong signal for at least a week. But their monitoring is only good to 5 minutes, so wouldn’t catch a 30 second outage.

My second theory is that my WAN device, a Cambium ePMP 1000, is genuinely just failing. That the device loses power or for some other reason reboots. Unfortunately it’s up in a tree and I don’t have the access to read any stats or SNMP data from it, so I don’t know. But that’d surely explain why the router decided eth0 was inactive, if it lost ethernet link.

The third theory is that it’s Ubiquiti’s watchdog itself that’s failing. If you configure a WAN2 failover (say, my LTE device) they run a watchdog process that tries to ping ping.ubnt.com every 10 seconds. If that fails 3 times in a row it decides the WAN1 interface is down, sets it inactive, and fails over to WAN2.

Evidence

But would they still be running this failover algorithm even if WAN2 is disabled and not running? I fear maybe so. Logging in as root on the USG and running show load-balance watchdog I see some evidence that the failover stuff is still happening.

Group wan_failover
eth0
status: Running
pings: 2457
fails: 0
run fails: 0/3
route drops: 7
ping gateway: ping.ubnt.com - REACHABLE
last route drop : Tue Oct 6 02:53:08 2020
last route recover: Tue Oct 6 02:53:59 2020

That timestamp corresponds to my latest failure and suggests the watchdog apparatus is still running. show load-balance status also has some interesting info in it.

There’s more information in /var/log/messages on the router:

Oct 6 02:52:38 ubnt zebra[753]: interface eth0 index 2 changed .
Oct 6 02:52:38 ubnt kernel: eth0: Link down
Oct 6 02:52:39 ubnt zebra[753]: interface eth0 index 2 changed .
Oct 6 02:52:39 ubnt kernel: eth0: 100 Mbps Full duplex, port 0
Oct 6 02:53:08 ubnt wlb: wlb-wan_failover-eth0 wlb-wan_failover-eth0 reachability failed, failover
Oct 6 02:53:08 ubnt wlb: group wan_failover, interface eth0 going Inactive
Oct 6 02:53:08 ubnt wan-event: [WAN Transition] eth0 to state inactive
Oct 6 02:53:59 ubnt wlb: group wan_failover, interface eth0 going Active
Oct 6 02:53:59 ubnt wan-event: [WAN Transition] eth0 to state active

In this log, the ethernet interface goes down first, before the watchdog is involved. That’s consistent with theory #2: the connection to the WISP antenna is flaking out. Looking back in the logs that’s the same pattern for many failures. The ethernet link goes down for about 2 seconds, then comes back. Then about 30 seconds later the failover system kicks in and marks the WAN inactive. Then some period of time later (51 seconds in this instance) the failover marks the link active again.

My local ping to the WISP antenna shows a shorter outage. Clocks are both NTP synchronized, but 7 hours off because of timezones.

Ping 35446 failed Tue Oct 6 09:52:42 2020.
Ping 35447 failed Tue Oct 6 09:52:47 2020
Ping 35448 failed Tue Oct 6 09:52:52 2020
Ping 35449 failed Tue Oct 6 09:52:57 2020
Outage for 21 seconds started Tue Oct 6 09:52:36 2020

I also have a different monitor pinging ping.ubnt.com from my own Linux box

Ping 197355 failed Tue Oct 6 09:52:42 2020
Ping 197356 failed Tue Oct 6 09:53:12 2020
Ping 197357 failed Tue Oct 6 09:53:22 2020
Ping 197358 failed Tue Oct 6 09:53:27 2020
Ping 197359 failed Tue Oct 6 09:53:32 2020
Ping 197360 failed Tue Oct 6 09:53:37 2020
Ping 197361 failed Tue Oct 6 09:53:42 2020
Ping 197362 failed Tue Oct 6 09:53:52 2020
Ping 197363 failed Tue Oct 6 09:53:57 2020
Ping 197364 failed Tue Oct 6 09:54:02 2020
Outage for 90 seconds started Tue Oct 6 09:52:32 2020

Putting that all together, the timeline:

  • 02:52:38 to 02:52:39 router notices WISP ethernet is down
  • 02:52:42 to 02:52:57 (approx) can’t ping the antenna IP address
  • 02:53:08 to 02:53:59 router marks WAN interface inactive
  • 02:52:42 to 02:54:02 (approx) can’t ping an Internet host

Note that even though the ethernet is down for 1 second, the whole Internet is down for 90 seconds. But the WISP antenna is only down for about 20 seconds.

My guess is the ethernet connection to the WISP antenna, including its power, failed. Or else the WISP antenna rebooted itself. The ethernet link comes back very fast on reboot on that device, like 1 second. But it takes 20 seconds to get up and running enough it’ll respond to pings. And in this case, 90 seconds to fully recover and be online again reliably. Not all the outages are this long; most are more like 60 seconds.

Next steps

So that’s all consistent with theory #2: the connection to the WISP antenna is flaking out. I mean it’s right there in the “link down” message. I’m still confused as to how a momentary break in ethernet signal makes for a 60 second outage.

I’ve asked my ISP whether it can give me more information about what’s happening to WISP antenna device. At least, its uptime.

I’ve got a pretty marginal wiring solution for the antenna; the ethernet cable is 350+ feet long and the PoE injector is at far end from the antenna. I’ve also got a simple passive cable joiner about 50 feet from the antenna. I should split it there and put the PoE injector and maybe an ethernet switch nearer the antenna and see if that makes it more reliable.

Update

Well damn, a theory shut down. My ISP got back to me:

I logged into the antenna, and it is showing an up time of 6 days and 7 hours at the time of this email. I also looked at the signal history for the last 3 days and found no breaks or loss of the signal between the antenna and the radio tower. That antenna has a reboot time of about 1 to 2 minutes.

So much for my theory that the antenna was losing power and rebooting.

However I’m still convinced that the ethernet link between the router and the antenna is failing. I mean, the router says so in its logs! That could still be a result of the long ethernet run through a coupler, so there may be still be an advantage to putting a switch in the middle of that long run; it may never have been PoE, but data / link failure. I also wonder if it’s possible the WAN1 port on the USG itself is broken. Seems unlikely but…

I had no outages at all for nearly 36 hours, then today I’ve had 10 outages in 5 hours (mostly overnight). Really confusing.

Update 2

It’s been bugging me that the local antenna seems to go down for 21 seconds, but then the whole WAN goes down for a minute+. I think we’ve got an unfortunate coincidence of timings. The watchdog default is (I think) 3 pings in 10 seconds failing for a failure, and 3 pings in 10 seconds working for a success. With my antenna link outages 3 pings just barely fail in a row. The 4th would work but the router has already decided the link is not working. Then it waits for 3 more pings before re-activating the link. I can sort of verify this behavior; I had the antenna fail a ping once for 6 seconds (don’t know why), and the router didn’t detect that as a real outage.

So for now I’m testing changing the config to need 6 bad pings in a row to detect WAN failover. Once the LTE device is back working it’ll be a longer wait for it to come online, but it might hide these ethernet link outages or at least make them less severe.

$configure
ubnt@ubnt# set load-balance group wan_failover interface eth0 route-test count failure 6
[edit]
ubnt@ubnt# set load-balance group wan_failover interface eth0 route-test count success 3
[edit]
ubnt@ubnt# show load-balance group wan_failover
flush-on-active enable
interface eth0 {
route-test {
count {
failure 6
success 3
}
initial-delay 20
interval 10
}
}
interface tun900 {
failover-only
route-test {
initial-delay 60
interval 600
}
}
sticky {
dest-addr enable
dest-port enable
source-addr enable
}
transition-script /config/scripts/wan-event-report.sh
[edit]
ubnt@ubnt# commit
[edit]
ubnt@ubnt# save
Saving configuration to '/config/config.boot'…
Done
[edit]

These changes aren’t persistent; next time the controller reprovisions the router they’ll be wiped out. The linked docs talk about how to make them persist through that too but I’m told these only work through a USG, not a Dream Machine. Fortunately I have the USG.

Update 3

Those changes may have made things worse; outages seem to be even longer. Honestly I can’t really understand what’s happening. Since making those changes:

Outage for 60 seconds started Mon Oct 12 14:19:03 2020
Outage for 90 seconds started Tue Oct 13 16:13:22 2020
Outage for 75 seconds started Tue Oct 13 16:18:27 2020
Outage for 51 seconds started Tue Oct 13 16:29:32 2020
Outage for 75 seconds started Wed Oct 14 08:37:20 2020
Outage for 16 seconds started Wed Oct 14 08:38:35 2020
Outage for 14 seconds started Wed Oct 14 19:08:41 2020
Outage for 55 seconds started Wed Oct 14 21:09:21 2020
Outage for 121 seconds started Thu Oct 15 01:40:31 2020
Outage for 90 seconds started Thu Oct 15 18:25:41 2020
Outage for 90 seconds started Fri Oct 16 19:50:06 2020

I’ve tried setting the “success” timer from 3 consecutive pings (ie: 30 seconds) to just one. Not ideal if there’s a working failover, but in the absence of one I think this is better than waiting?