OpenWRT + RPi4: failover and load balancing

I have two ISPs; I want to use Starlink all the time unless it goes down, in which case I want to failover to my fixed wireless. OpenWRT supports this with the MWAN3 package along with load balancing and various forms of traffic-based routing. It’s flexible and a little complicated. The whole reason I am ditching my Ubiquiti router is their long standing bugs in their failover implementation; I’m hoping OpenWRT’s is more reliable.

The MWAN3 docs are excellent but do assume some networking knowledge. Long story short, it’s a set of scripts that implements load balancing and failover by manipulating Linux policy routing. Basically it pings the interfaces and adjusts the routing table appropriately if a connection goes down.

See also my previous post on OpenWRT + RPi4 with a single WAN.

Hardware and device names

I’m using a Raspberry Pi 4 as my router. I’ve purchased two TP-Link UE300 USB ethernet adapters for my WAN ports. They are plugged in to the two blue USB-3 ports. This means I’m going to have two actual hardware ethernet devices for the WAN, eth1 and eth2. (eth0 is the LAN which is the RPi’s on-board ethernet.) Some of the MWAN docs talk about VLANs and switches and devices named like eth1.0; no VLAN needed with the USB hardware.

Unfortunately as far as I can tell OpenWRT has no built-in provision for naming devices consistently. When I was plugging and unplugging things I would see one of my adapters go from being “eth1” to being “eth2” despite it always being plugged into the same port and the MAC address not changing. This is not good, it makes identifying which interface you want to use for what difficult. However I think as long as you don’t hotplug anything the device names are assigned consistently in order. If you look in dmesg there’s messages like these:

usb 2-1: New USB device found, idVendor=2357, idProduct=0601, bcdDevice=30.00
r8152 2-1:1.0 eth1: v1.10.11

USB 2-1 seems to be the name for the upper blue USB port. With two dongles plugged in, USB 2-1 seems to always get named eth1 and USB 2-2 always seems to get named eth2. All bets are off if you unplug and replug things afterwards though.

A better solution would be to name devices by the MAC address or some other (relatively) immutable property of the hardware. I found some notes on doing this online here, here, and here. That last one refers to the OpenWRT “Community Build” for RPi4 but that project looks a bit messy and there’s some drama going on with it so I haven’t tried it.

Anyway, after 5 reboots the device names have been stable so I’m going to stick with this. For my notes:
Starlink, usb 2-1, eth1, WAN1, 5C:A6:E6:AA:BC:9A
SmarterBrodband: usb 2-2, eth2, WAN2, 54:AF:97:5B:F9:63

One last note on naming; OpenWRT has a concept of “interface name” in Networking / Interfaces, a step of indirection from devices names. Ie, I have eth1 named “WAN1” in OpenWRT and a lot of other OpenWRT software uses WAN1 as the name. I ran into a problem because I named my interfaces WAN1/WAN2. The mwan3 scripts default to assuming these names are “wan” and “wanb” and I regret not using those names for simplicity.

Setting up the second ISP

Before you do anything with mwan3 you have to get the second ISP working in OpenWRT. This is as simple as adding it in Network / Interfaces. The “gateway metric” advanced setting is important; Linux will send packets to the gateway with the lowest metric. If two devices have the same metric (which is the default if you don’t edit it) I’m not sure what happens, it looks to only use one of them without mwan3 or some other load balancing configuration.

If you’re using luci-app-statistics this is a good time to go to Statistics / Setup and make sure you have graphs turned on for all of your network interfaces.

Installing and using MWAN3

This is as simple as opkg install mwan3 luci-app-mwan3 (or the LuCI equivalent). However the moment you install mwan3, mwan3 is going to start running and doing routing. The default config is reasonable, it tries to load balance on “wan” and “wanb”. But because I used “wan1” and “wan2” instead as my names nothing worked. Ironically I hadn’t even installed the LuCI interface yet, so without I had to edit config files by hand.

The config file is in /etc/config/mwan3. There is also a “mwan3” command line tool that is very helpful when tinkering.

The LuCI interface is in two places. Network / Load Balancing is where the config is done. Status / Load Balancing is the status page. (Some of the docs refer to an obsolete location for the status page.)

Configuring MWAN3

The docs are great so I won’t try to reproduce them. In summary though you describe interfaces, one per ISP connection. Then you define members which are statements like “wan-m1_m3 uses interface WAN1 with metric 1 and weight 3”. Members are combined into policies; the member that’s currently up and valid and has the lowest metric is the one the policy will use. If there’s two members with the same metric, it will load balance according to the weights. Metrics get combined into policies like “balanced” (use both wan and wanb) or “wan_wanb” (failover; use wanb only if wan is down). Finally you define a set of rules which say “traffic of this type gets sent out this policy”.

It’s a complicated and flexible system. The default config is to load balance the two connections with slightly unequal weights. I just edited the rules to use the “wan_wanb” policy instead of “balanced” to get strict failover behavior instead. (I also edited to accomodate my different naming convention but only for interfaces; I left mwan’s “wan” and “wanb” naming for members, policies, etc).

There’s a lot of subtle failover behavior you can configure, particularly in the interfaces section. That’s where you set things like ping intervals or how many pings have to fail before the interface is considered dead. I’ve left all the defaults in place other than changing the IP addresses it pings. The package uses the docs’ defaults except reliability is set to 2. I think that means the behavior is it tests every 10 seconds. A test is it pings each of 4 hosts once with a 4 second timeout; if 2 of the pings succeed then the test passes. If 5 tests fail in a row the link is considered dead. 5 successes and the link is alive again. There’s a complicated extra set of heuristics for “check_quality” but I believe those are disabled by default.

There’s also a global option to enable syslogging. Turning that on to “notice” gives a reasonable, non-spammy syslog of failed pings and failover.

Does it work?

Yes! So far so good. It all seems a bit complicated for basic failover but once I got going the configuration was really pretty simple.

One gotcha when testing; if you search Google for “what is my ip” it will helpfully show you your IP address. But this might be a little out of date; in some cases you can still have a browser connection open on the backup WAN interface even though new connections are being done on the primary WAN. I think curl is a more reliable way to verify what your router is choosing at the moment.

Observed behavior on the testing is a bit more complicated than I thought. The syslog messages from mwan3track talk about a “score” that’s getting incremented faster than the 10 second interval I thought it had.

Basic failover seems to be functioning. I’ll know more in a day or two, I’m curious how well the default heuristic is going to work with Starlink’s “drop out for a few seconds” failure mode as well as the nightly congestion. The worst thing is if the router flaps over to failover mode too often, sometimes it’s better to wait out a transient failure. Will need some tuning no doubt.

I want to give a shout-out to both the LuCI mwan3 UI and the Wiki docs. They’re both quite understandable and nicely designed.

Update: failover didn’t work last night. Around 4am Starlink got a firmware update and rebooted itself. My monitoring tells me my house couldn’t ping for about 4 minutes afterwards. Looking at the syslog it’s clear mwan3 was aware of the outage; it detects that the network lost carrier and talks about setting the link down, then up again in a few seconds. There’s talk about disabling the interface, etc. But there’s no logging about failed pings (which I would expect) and judging by my monitoring, it did not fail over to the backup. There’s also several log lines with errors in them like “netifd: WAN1 (10264): Command failed: Permission denied” which does not inspire confidence.