SVMs in Python

I started trying to apply SVM machine learning to my Battlefield 4 dataset and got something that only predicted my data with 60% accuracy. I’m expecting 95% or more but I don’t really know and I realized I was trying to learn too many new things at once.

So I went back to my Coursera homework, exercise 6, and used their data and problem sets as Python exercises. Specifically I trained SVM classifiers on two datasets. The result is in this notebook.

Cells 2–4 are me simply fitting a binary classifier to the toy input data. Cells 5–9 are fitting a binary classifier to different data, in this case performing repeated training steps to pick the right parameters C and sigma for the learning system. I did it the homework’s way in cells 6–7, using an explicitly split training and test dataset. Then I used sklearn’s cross_validation library to do the training in 8–9, which is a form of resampling to produce test sets on the fly. Results were roughly the same. I’m not 100% certain it is philosophically kosher to use CV in this manner, but I think it is.

Things I learned:

  • You can load matlab arrays with scipy.io.loadmat()
  • The ‘rbf’ kernel is what sklearn.svm calls the Gaussian kernel.
  • Instead of sigma, sklearn parameterizes rbf with gamma. gamma = 0.5/sigma**2
  • Decision boundaries are plotted by simply creating a raster grid of predictions, then making a contour plot to only show the boundary.

The big thing I learned is how sensitive these systems are to choices of sigma and C. Before I had the code working right I was using bad values and getting insane results. Relatively easy to see in a 2d dataset where you can plot the data and the decision boundary, but this must be much harder to understand and debug in large multidimensional datasets.