Machine Learning: Support Vector Machines

After weeks of complaining my machine learning course had too much detail on the math and guts of ML algorithms and not enough on applying ML to real data, I got my wish. This week’s course on Support Vector Machines (SVMs) was pretty high level and breezy, and the homework even more so. And now I’m dissatisfied, feel like I didn’t get my week’s learning’s worth! Not that I’m complaining to have an easy week, but I wish there were a bit more brain-bending around application to replace all brain bending I was doing before figuring out the vector math and programming.

PS: I’ve created a Machine Learning category on the blog. And mucked with the theme. Boy, is super buggy and badly product managed.


The main topic this week was Support Vector Machines, a supervised learning technique that was framed as being more powerful and practical in application than the previous techniques we used like linear regression, logistic regression, and neural networks.

Conceptually it works a lot like these other supervised learning systems. You define a cost function, but unlike the logistic cost function this one biases the system to prefer “large margins”. Ie basically it’s not enough to say “49% chance it’s category A”, the training is encouraged to say “10% chance it’s category A” instead. You control this bias/overfit tradeoff with the parameters C and sigma. C is much like our lambda from before, a damping factor to encourage small constants. (Confusingly, C = 1/lambda). sigma shapes the cost function curve itself, at least in the usual Gaussian kernel.

Oh yes, kernels. This is a neat trick where you can modify your input features with various functions. A linear kernel gives you linear regression (exactly? or more or less?). The Gaussian kernal is a good general purpose kernel. There’s a bunch of other kernals people use. And some magic allows this pluggable kernel function to be applied efficiently, so the training system runs quickly.

That’s where the lectures got a bit hazy. There was one whole 20 minute video dedicated to the math which was prefaced with “this is optional, and you should never try to implement this; use libsvm instead”. The main thing I learned from this is you can still understand speech when watching a video played back at 2.2x speed, even though the individual phonemes are no longer distinguishable. He skimmed over the math quite a bit, nothing was lost by ignoring it.

I never did quite understand why SVMs are better. We were advised they are best in applications with lots of training examples relative to the size of the feature set. Logistic regression is better if you have lots of features and few training examples. And neural networks may be better than SVMs too, but take longer to train. ¯\_(ツ)_/¯


The homework this week was super-easy, all about applying SVM to problems. Each of the 4 exercises was like 3 lines of code. The only hard part was learning Octave syntax for doing things like “find the element in the list”, which once again made me wish we were using Python.

Anyway we implemented two learning exercises. An artificial problem of “find the boundary between Xs and Os for 2d input points”. And a real problem of “build a spam classifier trained on this data set”. We were supplied with the SVM ML system itself, so all we had to do was write simple functions to compute the Gaussian kernel and boil a list of stemmed words down into a vector coding. It was kind of dumb.

The most useful assignment was writing a loop to try out a training system with various values of C and sigma, the tuning parameters for the SVM. And experimentally determine what values gave the most accurate trained model. I imagine this is the kind of thing you do in the real world frequently, and doing it well is an art.

The spam filter problem was also fun because it felt real. Took inputs from the SpamAssassin spam corpus. Used their code to crunch that text down to stemmed words, which we then coded into feature vectors. Push it through SVM and you end up with a system that classifies 98.6% of spam correctly! Which is not so great, I think 99.99% accuracy is minimal for a useful spam system. And even then you really want to measure that false positive rate carefully. But I had a whisper of an idea of how to apply this stuff to a real problem that is really solved with ML systems like what we are studying, and that was fun.


Once again I find myself wanting to do this in Python. I really want a class which is “applied Machine Learning in Python”. I guess focussing on how to use Pandas and scikit-learn. Maybe someone has written that up already?