Debugging a WiFi problem

Still tinkering with Raspberry Pis and my Sunpower PVS6 solar controller. I had it all set up with a new RPi Zero 2 W when I had a new problem; the WiFi was flaky. It kinda sorta mostly worked well enough, I could get the HTTP requests through to it usually. But ping was showing 40% or more packet loss with a latency of 200+ms. I have a very clean WiFi environment, that should be 0% loss and 2ms latency.

A whole lot of testing later and I established that the problem is some sort of interference on 2.4GHz Channel 1, but only in my garage near the Sunpower box. That corner also has all my other ethernet equipment, is where the generator and solar feeds come in, and where the utility power and electrical panel are. So a lot of wires. None of that stuff should interfere with 2401-2423Mhz and there are a lot of mysteries still.

Switching my access point to channel 3 (2411-2433Mhz) seems to solve the problem. Part of the root cause of my troubles is I’d pinned the WiFi channel in the garage to 1 a year or more ago, I forget why. I’ve now set it to auto. I’ve had experiences where it seemed to be working for a few hours at a time before, so I don’t fully trust it yet. Edit: after rebooting the AP went back from channel 3 to 1. So I’ve configured it to hardcode channel 3 instead of “auto”. Whatever the interference is, the Ubiquiti hardware isn’t detecting it.

The most useful thing I did in the end was get 2 access points and 3 clients and try moving them around every few hours to see what triggered the problem. I used telegraf to ping them once a minute or so (but see below) for long term monitoring, and then occasional command line pings once a second for short term testing. Over two days I was able to narrow the problem down to the one channel in the one location, but it’s not related specifically to any particular hardware.

Some things I learned:

  • wavemon is a helpful tool on Linux for showing WiFi status. so is /proc/net/status and learning all the arcane options to the iw command.
  • The UniFi dashboard has some decent stats of its own hidden away under “Air Stats” for each individual access point device.
  • WiFi has its own link layer retransmit; lots of those might account for higher latency. But I don’t know what timescale it is on: 1 ms? 500ms? You can see WiFi retransmits in wavemon as “failed” and in Air Stats as “retry”.
  • Despite all these diagnostics none of them really showed me evidence the link was bad. All the basic link quality reports were 90% or better. No ridiculous number of retries (although it is 16% at the access point.)
  • Linux machines, including RPis, need to be configured for the country for wifi to work at its best. Particularly important for 5GHz.
  • Even with a perfectly good 5GHz channel available an RPi4 might still choose 2.4GHz. I could not find a way to reliably coerce it to use 5GHz.
  • Raspberry Pis have a variety of low power optimizations that mean they may not respond well to a ping every minute when otherwise idle. My RPi4 regularly takes 100ms to answer the first ping, then is a solid 2ms after. My original RPi Rev B just misses pings entirely if nothing else is happening on it. The RPi Zero 2 W seems pretty solid but then I’ve got it doing other real computation. Anyway my once-a-minute ping was showing phantom errors where there were none. A once-a-second ping test worked better.
  • There’s a power saving mode in the RPi radio hardware itself you can turn off with iw.
  • Raspberry Pis may in general just not be very good at wifi.
  • Don’t use mDNS names for ping testing. The mDNS query itself may get lost to a flaky / low power Pi and you never find the host.
  • TCP retransmits hide a lot of sins. But it only works right until it doesn’t. When my WiFi link was bad about 5% of my requests wouldn’t get through at all.