Machine Learning: Support Vector Machines

After weeks of complaining my machine learning course had too much detail on the math and guts of ML algorithms and not enough on applying ML to real data, I got my wish. This week’s course on Support Vector Machines (SVMs) was pretty high level and breezy, and the homework even more so. And now I’m dissatisfied, feel like I didn’t get my week’s learning’s worth! Not that I’m complaining to have an easy week, but I wish there were a bit more brain-bending around application to replace all brain bending I was doing before figuring out the vector math and programming.

PS: I’ve created a Machine Learning category on the blog. And mucked with the theme. Boy, wordpress.com is super buggy and badly product managed.

Lectures

The main topic this week was Support Vector Machines, a supervised learning technique that was framed as being more powerful and practical in application than the previous techniques we used like linear regression, logistic regression, and neural networks.

Conceptually it works a lot like these other supervised learning systems. You define a cost function, but unlike the logistic cost function this one biases the system to prefer “large margins”. Ie basically it’s not enough to say “49% chance it’s category A”, the training is encouraged to say “10% chance it’s category A” instead. You control this bias/overfit tradeoff with the parameters C and sigma. C is much like our lambda from before, a damping factor to encourage small constants. (Confusingly, C = 1/lambda). sigma shapes the cost function curve itself, at least in the usual Gaussian kernel.

Oh yes, kernels. This is a neat trick where you can modify your input features with various functions. A linear kernel gives you linear regression (exactly? or more or less?). The Gaussian kernal is a good general purpose kernel. There’s a bunch of other kernals people use. And some magic allows this pluggable kernel function to be applied efficiently, so the training system runs quickly.

That’s where the lectures got a bit hazy. There was one whole 20 minute video dedicated to the math which was prefaced with “this is optional, and you should never try to implement this; use libsvm instead”. The main thing I learned from this is you can still understand speech when watching a video played back at 2.2x speed, even though the individual phonemes are no longer distinguishable. He skimmed over the math quite a bit, nothing was lost by ignoring it.

I never did quite understand why SVMs are better. We were advised they are best in applications with lots of training examples relative to the size of the feature set. Logistic regression is better if you have lots of features and few training examples. And neural networks may be better than SVMs too, but take longer to train. ¯\_(ツ)_/¯

Homework

The homework this week was super-easy, all about applying SVM to problems. Each of the 4 exercises was like 3 lines of code. The only hard part was learning Octave syntax for doing things like “find the element in the list”, which once again made me wish we were using Python.

Anyway we implemented two learning exercises. An artificial problem of “find the boundary between Xs and Os for 2d input points”. And a real problem of “build a spam classifier trained on this data set”. We were supplied with the SVM ML system itself, so all we had to do was write simple functions to compute the Gaussian kernel and boil a list of stemmed words down into a vector coding. It was kind of dumb.

The most useful assignment was writing a loop to try out a training system with various values of C and sigma, the tuning parameters for the SVM. And experimentally determine what values gave the most accurate trained model. I imagine this is the kind of thing you do in the real world frequently, and doing it well is an art.

The spam filter problem was also fun because it felt real. Took inputs from the SpamAssassin spam corpus. Used their code to crunch that text down to stemmed words, which we then coded into feature vectors. Push it through SVM and you end up with a system that classifies 98.6% of spam correctly! Which is not so great, I think 99.99% accuracy is minimal for a useful spam system. And even then you really want to measure that false positive rate carefully. But I had a whisper of an idea of how to apply this stuff to a real problem that is really solved with ML systems like what we are studying, and that was fun.

Application

Once again I find myself wanting to do this in Python. I really want a class which is “applied Machine Learning in Python”. I guess focussing on how to use Pandas and scikit-learn. Maybe someone has written that up already?

autossh for a persistent ssh tunnel

My Internet in Grass Valley is newly fast, but my ISP’s NAT setup  means I have no way of initiating connections from outside into my home network. Nice for security, awkward for hackiness. (They do offer a DMZ configuration but when I had them enable it on the previous link it never worked and broke traceroute, so I’m not asking this time.)

Instead I’ve set up a persistent ssh tunnel so that a port on my Internet server is always forwarded through to the ssh port on my Linux server in the house. That way I can ssh in remotely through the tunnel.

The only novel part of this is autossh, which is a wrapper for ssh that takes care of restarting the ssh tunnel if something goes wrong. autossh logs to syslog and has some configuration environment variables, none of which I needed to alter. It monitors for the tunnel dying and also periodically sends traffic through to make sure the tunnel is still working. ssh also has a keepalive mechanism it’s probably wise to use.

I mostly followed this guide for setting it up. Except for the Upstart config; after spending 30 minutes tearing my hair out I remembered I hate upstart and a good ol’ BSD 4.2 rc.local script would do just fine. Also I set stuff up to run as a separate user, and made an ssh config, so really it ended up being quite different. Steps:

  1. create a new “autossh” user on both machines with a shell of /bin/false
  2. set up ssh keys without passwords for this user, so that my house computer can log into my Internet server without any passwords
  3. Set up an ssh config for the tunnel for the autossh user at home.
    Host tunnel
      ServerAliveInterval 60
      HostName example.com
  4. Test that a basic manual ssh tunnel works as user autossh, without a password. This is important; first time you have to accept the ssh key manually!
    home$ autossh -N -R 2200:localhost:22 server
  5. Put the command to launch the tunnel in /etc/rc.local on the home computer
    su autossh -s /bin/bash -c 'autossh -fN -R 2200:localhost:22 tunnel'

Note the “-s” flag to su in the init script. Without that it runs as the /bin/false shell the user has which not only doesn’t work, but doesn’t give any useful error output.

Ubiquiti: less LoL lag

One nice change from with the new network, less lag playing League of Legends.

With my old hacky setup with triple nat and routers acting as a bridge, I had a lag pattern like thishacky lag

Now with the Ubiquiti gear I have a lag pattern like this:

Lag after

Note the lack of spikes! I have no idea what those were, something was getting buffered or something weird before.

I don’t think median latency or packet loss really changed, but I’m glad not to see those ping spikes. Although TBH they didn’t affect my gameplay much.

Ubiquiti success, sort of

Running a wire at my house isn’t working out, so I’m stuck needing a wireless link to get the 200′ outdoors to the base of a tree. Fortunately there’s power there! Today I replaced the hacky leftover router wireless link with a more permanent setup involving Ubiquiti hardware. So far so good, some things to improve.

The Ubiquiti equipment is interesting. In a lot of ways it’s just like a WiFi router with flexible firmware like DD-WRT on it. It has a wifi network interface and a wired network interface and various routing and bridging capabilities. But it’s more prosumer. Ubiqiuti has pretty solid firmware (airOS) and I’m impressed with details like the web-based discovery tool and the way it reconfigures itself quickly without rebooting. Also the wireless implementation seems much more solid than consumer gear. And the hardware is designed for outdoor use with directional antennas, and supports distances of 5 miles out of the box. Pretty impressive.

One massive caveat: the wireless link startup time is slow. Like 30+ seconds, maybe 5 minutes the first time. It appears to be scanning the spectrum for the right channel to use. Not sure why that takes so long, or why you can’t just rely on normal 802.11 frequency hopping for the 5GHz links.

I set up a Nano M5 and a Nano Loco M5 paired together as a transparent ethernet bridge. That configuration means those devices are invisible to the rest of my network. My router sits at the other end and gets DHCP from my ISP’s equipment. The wireless network SSID for the nanos isn’t even visible (and anyway it’s not standard 802.11). I think this is the right configuration for my purposes.

The configuration is like this:

  • A WISP POP a mile away with IP 173.195.173.1
  •    … via a wireless link to …
  • ISP hardware in my tree, acts as a DHCP server and NAT router.
    The WAN side is at 173.195.173.xxx
    The LAN side is at 10.33.1.1 and provides DHCP and NAT to my house
  •   … via an ethernet cable to …
  • Nano M5 in Access Point mode. WDS is enabled for bridging. This device has an IP address of 192.168.0.111 but that’s mostly invisible.
  •   … via a 5GHz airMax link to …
  • Nano Local M5 in Station mode. WDS is enabled for bridging. This device has an IP address if 192.168.0.110 but that’s mostly invisible
  •    … via an ethernet cable to …
  • My home router, an ASUS RT-N16 running as full AP + Router.
    The WAN side gets its IP address via DHCP from the ISP (via the Nanos). It happens to be 10.33.1.23 at the moment.
    The LAN side is 192.168.0.1 with subnet 255.255.255.0. Also provides DHCP and all the other home network services you’d expect
  •    … via an ethernet cable or 2.4GHz 802.11 link to …
  • Desktop computers, mobile devices, etc in 192.168.0.*

The good news is I’m not doing triple-NAT anymore. My router is doing all the real NAT work for my house. The ISP’s hardware is also NAT but I’ve taken pains to only have the one device connecting to it.

The main problem with this configuration is from my house network I can’t access the Nano status pages at 192.168.0.110/111. My router and the rest of my house’s devices think all of 192.168.0.* is on the LAN interface so won’t send packets out the WAN interface where they are. I could hack around this with static routes but that’s dumb. I should reconfigure the Nanos to a different subnet like 192.168.25.*. Then I add a single static route for that subnet in my router and I think I’m done. (I could set them to 10.33.1.* but that’s a bit funky, I definitely don’t want to risk those devices being visible to the wider Internet.)

The other small problem I have is the secondary PoE port on the Nano M5 isn’t working. In theory this should provide power to the ISP equipment in the tree. Power passthrough isn’t enabled by default but I fixed that, still not working. More tinkering required, for now I just use a second PoE injector.

This Ubiquiti firmware is pretty powerful and I feel like there’s some simpler configuration where I get rid of my house router entirely, use one of the two Nanos to be the router too. I’m not sure that really simplifies anything though, I’d probably still want the third box acting as an access point inside the house.

Update

I renumbered the Ubiquiti bridge devices to 192.168.1.110 / 192.168.1.111 and added a static route for them in my router. Works great. Next project; some sort of monitoring for them. They have SNMP and I find references to Munin plugins.

I also figured out the secondary PoE problem. I thought maybe it was total power; the Ubiquiti device wants 8W, the Cambium radio in the tree wants as much as 7W. So 15W total and the PoE injector only puts out 12W. But then I read more closely and discovered that the Cambium PoE implementation is reversed from the usual passive setup. Cambium puts positive voltage on pins 7 & 8, not 4 & 5 like everyone else. How stupid! Really dangerous, I’m lucky the magic smoke stayed inside the Cambium gear when I plugged it in. Maybe they engineered in some safety to overcome the perversity of using reverse polarity from what is common. (I hesitate to say standard, because there is no standard for passive PoE, although the 802.3af standard has a mode B that looks an awful lot like the passive PoE everyone but Cambium does.)

I’ve also discovered the radio in my house works in a closet, behind a relatively thick wall. Signal strength is degraded; airMAX quality is 78% compared to 95% with a clear view through the window, and max speed is about 200Mbps instead of 300Mbps. Still well above what I need for the 12Mbps Internet link though. Going to test it out, would be nice to not have to mount that thing outdoors or in the window.

Outdoor wiring

(Some very boring notes so someone else can wire this up)

Conceptually, we’re connecting a wire from the antenna in the tree to the Ubiquiti NSM5, one long straight wire. But it’s complicated because we need separate PoE injectors to provide power to both devices.

The black wire from the tree goes to the Phihong PoE module. It goes to the port on the left, labelled “Data + Power”. Very important to use the right module; the Cambium tree antenna uses non-standard PoE and this Phihong PoE module is special. The word “PHIHONG” is in the upper left of the label, beneath a spiral P logo. It’s the longer module, that does not say “Ubiquiti” in the upper right. The top one in this photo.

An ordinary short ethernet patch cable goes from the “Data” port on the right side of the Phihong PoE module to the “LAN” port on the right side of the Ubiquiti PoE module. This cable is currently a 3′ coil of white wire (would be nice to replace with a 1′ cable). This cable is not powered, not PoE, it just carries data between the devices.

Another ordinary ethernet cable goes from the “PoE” port on the left side of the Ubiquiti module to the Ubiquiti NSM5 antenna. It goes to the port on the right labelled “Main”. The port labelled “Secondary” is not used.

Finally, plug both PoE modules into AC power. There’s an LED on the back of the Ubiquiti radio to confirm it’s powered, but it’s pretty hard to see in bright sunlight. Once the devices have power they both take about ~60 seconds to establish wireless links.

WiFi protocol notes

I’ve been doing Wikipedia reading associated with my wireless experiments on how 802.11n works. Here’s some brutal summaries, note I’m totally ignorant of signals processing and believe taming the electromagnetic field is basically witchcraft. I’m sure I made a mistake somewhere.

802.11n notes

  • 802.11b and 802.11g are basically a single wireless signal on a single 20MHz wide channel. 802.11g uses orthogonal frequency division multiplexing to split that 20MHz up into 52 separate subcarriers. There’s 8 different modulation and coding rates possible in 802.11g, resulting in data rates of 6, 9, 12, 18, 24, 36, 48, or 54 Mbps.
  • 802.11n is based on 802.11g.
  • 802.11n’s main innovation is multiple antennas and MIMO.
  • 802.11n can optionally use a 40MHz channel. That’s a bit naughty at 2.4GHz; it means you overlap with pretty much every other 2.4GHz user around you. My iMac doesn’t appear to support 40MHz at 2.4GHz.
  • 802.11n can also run at 5GHz. (My router doesn’t.) I don’t think it runs any faster at that frequency, although the channel separation is nicer.
  • Max 802.11n datarate at 20Mhz channels is 288.8Mbps. At 40MHz it’s 600Mbps.
  • Multiple antennas also allow for MIMO, multiple-input and multiple-output. I don’t understand this at all but the Wikipedia page says this is about exploiting multipath effects to effectively get more useful signals out of the same bandwidth. This seems like complete magic to me.
  • A specific thing MIMO does is spatial division multiplexing. I think this is effectively a directional thing, ie: “I can talk to radios north of me and radios south of me independently”. Only I’m guessing this has more to do with orientation of the antenna, and may explain part of why simply rotating my antennas so they were not all parallel resulted in a better connection. (If the access point’s antennae are parallel then they are not spatially disjoint.) My iMac is currently enjoying two spatial channels to the router, which makes more sense if you think of rotational spatial division.
  • You get 1 spatial stream per antenna+radio (sort of), and 802.11n supports up to 4 antennas. My Asus RT-N16 is a 2×2 configuration so that’s effectively 2 transmit and 2 receive antennae. (There’s only 3 physical antennae, no idea how that works). That’s why it maxes at 300Mbps; only 2 spatial streams. With 20MHz bandwidth it maxes out at 144.4Mbps. My iMac gets 1 spatial stream sometimes and 2 most of the time.
  • 802.11n optionally supports a 400ns guard interval, half the space between separate data frames as the old 800ns. That seems to boost throughput about 10%. The Wi-Fi signal app on my iMac reports speeds that suggest I am using this, sometimes. (Ie right now: MCS index 7, but 73 Mbps).
  • 802.11n supports a bunch of different coding rates. The coding rate describes how much of the signal is reserved for redundant error correcting bits. My iMac is typically seeing a coding rate of 3/4 or 5/6.
  • 802.11n supports several different modulation schemes. That’s the digital to analog part of the radio, the way bits are turned into waves. My iMac is mostly using 16-QAM or 64-QAM, which corresponds to 4 or 6 bits per symbol. Neat visualization here
  • The coding rate, modulation, number of spatial streams, and maybe even the guard interval are varying frequently, at least once a second. Presumably the access point is picking which variant to use depending on the current error rate.
  • 802.11ac is the new hotness. It’s 5GHz, supports 80Mhz and 160MHz channels, 8 spatial streams, 256-QAM (with non-standard 1024), etc. If my arithmetic is right the spec could in theory go up to 8Gbps (compare 600Mbps for 802.11n). I think in reality you can actually get gigabit wireless speeds. But only if the spectrum is uncongested and the signal path is clear. My fancy ASUS RT-AC68U claims up to 1300Mbps at 5GHz. That seems to correspond to 2 streams at MCS index 7, which is 64-QAM 5/6 rate.

MCS Index

Most of the 802.11n options are summarized in a single number called the MCS Index. See also this table. MCS Index specifies the number of spatial streams, the modulation type, and the coding rate. Those together define how much data can be stuffed in a single subchannel. Combine that with your bandwidth (20Mhz or 40MHz) and your guard interval and you get a max data rate.

In practice my iMac mostly hangs out at MCS 7 or MCS 12. MCS 7 is a very dense modulation and coding, but only one spatial stream, and it’s reporting 73Mbps. MCS 12 is a less dense modulation but 2 spatial streams, and reports 78Mbps.

Misc other observations

The upstream bandwidth from my link is 12Mbps, so I don’t really care about any of these higher speeds. I’m much more interested in reliability.

I had my RT-N16 configured to use 40MHz but my Mac wasn’t using that. (From a quick search, Apple doesn’t support 40MHz at 2.4GHz.) So I knocked it down to 20MHz in the router and now I think I’m getting a stronger signal. -60dBm instead of -65dBm. I suppose that’s possible? I didn’t measure carefully.

I’ve got two wireless devices connected to the router; my iMac and an old WRT54GL running 802.11g. I tried making both of those wireless links busy, about 5Mbps each, and didn’t see any obvious contention. My iMac didn’t even drop down to 1 spatial stream which I was sort of naively expecting. That’s all as it should be: 802.11g promises 54Mbps bandwidth and I wasn’t getting near that. Nice that it really works.

Google’s new OnHub router does 802.11ac and 802.11n in a 3×3 configuration.

My 2013 iMac’s wifi antenna is behind the Apple logo in the case. Why? Because it’s the only part of the back that’s not made of aluminum.

Machine learning: picking the right system

Week 6 of my machine learning course.

Lectures: applying machine learning, system design

This week’s lecture was about using machine learning algorithms well. Great series of notes, the kind of thing I’m really in this course for. Focus in particular was on figuring out how to improve the results of your learning system and then evaluate the result. You have a lot of options to tweak the model and learning: more properties, more data, different learning rates. Which should you use? He gave a few heuristics better than guessing.

The first thing is splitting data into training and test sets. Keep the evaluation test data separate so you have something to test with that the learning algorithm never saw. Good advice. He also points out that you often use tests on trained systems to pick your trained system; you can’t use that test data (“cross-validation data”) for final evaluation either, since your metalearning system was picked to do well on exactly that data.

The other thing was defining “high bias” models (underfit models, too few parameters) and “high variance” models (overfit models, too many parameters). These are important concepts! Then he gives a test to figure out if your learning system is one or the other, which is basically looking at error rates. Systems that train to low error but then test badly are high variance, overfit. Systems that train to high error and test badly are high bias, underfit. I get the intuition for this and look forward to apply it in the homework.

Finally we moved on to ways to build systems and evaluate them for success. Really nice exhortation to build a quick and dirty system first with the simplest features and data you can get your hands on. Start working with the problem and you’ll understand it better, then you can make an informed decision about what extra data you need. Also an interesting observation about the usefulness of very large datasets. Having an enormous amount of test data means you can use more complex models with less bias, and be confident that you won’t overfit your training data just for the sheer size of that training data set.

Ng defined precision and recall for measuring an algorithm’s correctness, also boiling those two numbers down into a single F-score for ease of comparison. Useful in particular for skewed datasets where you may be classifying something that’s only True 1 iin 1000 times.

Homework

This week’s homework has us loop back and implement the very first model we learned, linear regression, then test and evaluate it according to some of what we learned. I didn’t care much for the linear regression part, but the the hands-on application of learning systems and error measurements was great.

The first half of the homework was “implement linear regression learning with all the bells and whistles”. We’d not done that before, only logistic regression, so it was similar and yet different. Not too challenging really, a bit repetitive from previous work, but reinforcement is good.

The second half of the homework was “train a bunch of learning models and measure their error”. Specifically comparing the error on the training set vs. the training on our reserved cross-validation set, and graphing learning curves, etc. This is a sort of meta-machine learning, choosing the parameters for our machine learning system so that we get a well trained model.

We looked first a a truly linear model, trying to fit a straight line to data that’s roughly quadratic. Lots of errors. The learning curve shows us that adding more data points isn’t really going to help make this model more accurate, it is over-biased. So then we switched to an eighth order polynomial model but trained it without regularization. That fits the training data great but fails cross-validation, because it is high variance.

So finally we applied regularization in the learning model, the mechanism that encourages the learning to low-magnitude constants for the polynomials. The encouragement strength is governed by the value lambda. So train lots of models with low lambda and high lambda and see which lambda is best. Lambda = 3, that’s best! At least for this problem.

Conclusion

I’m really itching to apply this stuff to my own problems now, to make it real. But I’ve got to solve two problems at once to do that. I need confidence in my own code without the safety net of homework problems with calculated correct answers. And I really want to redo this work in Python, to get out of the weird Octave world. Taking both leaps at once is foolish. I should either bring a problem of my own into Octave once to apply my proven code to new data. Or else port the homework to Python first and verify I still get the right answers to prove my new code. I tried the latter already and got frustrated, should try the former.

Project idea: open stuff with VLC

VLC doesn’t install itself as the default handler for media files like .m3u or .mkv or whatever. Not sure why, but the documented process is tedious.

I should write a simple script / program to set VLC to be the default app for a bunch of types that VLC supports. Maybe absolutely everything that would otherwise open in iTunes is a good start :-P The program duti is a good start on how to set associations in MacOS. It seems to work by using the Launch Services API.