Machine Learning study

My old colleague Karl Rosaen is taking time off from working to study Machine Learning. He’s meticulously documenting his curriculum and day to day work, good resources there.

One big part of Karl’s studying is Raschka’s book Python Machine Learning. It looks like a nicely practical book, I was looking for something like that last year and it wasn’t quite out yet. His example code is online. And his blog post about writing it is nice.

My biggest regret from my machine learning work is not applying it to real problems. Karl mentions that the Kaggle competitions are a good set of problems and datasets to work on.

Machine Learning: reflections on the Coursera course

I’m all done with my Machine Learning course on Coursera, and I wrote a lot about it as I was taking it. Time to look back and put it together.

First off, it was a good use of my time. I complained about aspects of the course and I definitely think it can be improved. But it was a good introduction to what machine learning is and enough hands-on that I feel like I can no go do my own work with real tools. I also mostly liked the Coursera structure, having weekly deadlines and assignments was the structure I needed to actually learn something.

Techniques and Approaches

The thing I was most interested in learning is the general gestalt of machine learning, like how to sit down and get a problem solved. That was covered reasonably well through the course to varying degrees. Some themes:

  • The general structure of “take a training set of data. train a model on it. evaluate the mode against your test set. apply the model”. In some sense all the machine learning systems do the same thing: map a vector of input data to a scalar output value. The various algorithms are just different approaches to learning how to map input data to outputs. I didn’t understand they all fit into this category before.
  • The specific way that the training step is basically just running a function minimizer over your cost function (ie: measure of error). The subtlety here is most minimizers require some understanding of the gradient as well, which means you need to take partial derivatives of your cost function. (I wonder if numeric approximations to the derivative are useful in practice?)
  • The concept of high bias vs high variance models, ie: overfitting the data vs. underfitting the data. Basically if your choice of algorithm has a lot of internal knobs to twiddle, it’s likely you have high variance and might overfit the data, particularly if you don’t have a big data sample. Alternately if your chosen model is too simple the system might not ever predict the data well because it lacks the expressive power. I particularly liked the way you can detect which problem your system has with measuring the learning curve, how well your system learns based on how much data you give it.
  • Precision, recall, and f-score as a way to evaluate the success of a model.
  • A principled way to normalize data so that all your numbers are on comparable scales of roughly -1 to 1, with mean 0.
  • Regularized learning algorithms, with an extra term in the cost function to discourage overfitting.
  • Machine learning pipelines. The way you can segment a broad problem like photo OCR or automatic driving into smaller machine learning problems. Also ceiling analysis to figure out which part of your pipeline could be improved the most.
  • Stochastic machine learning. For very large datasets just iteratively learn on subsets of the data. Naturally leads to online learning, systems that continually learn in response to new data.


Most of the class was a tour through machine learning algorithms. Too much time implementing those algorithms for my tastes, but at least there’s no mystery to many of them now. Things we learned:

Supervised learning (ie: your training data is classified with expected outputs you are trying to predict.)

  • Linear regression: fit a linear model to the data. Predict a single number from a vector of numerical inputs.
  • Logistic regression: fit a logistic model to the data. Predict a single binary classifier from a vector of numerical inputs. (Really it outputs a probability!) Can predict N classes with one-vs-all classes.
  • Neural networks. Fit multi-stage regression to the data. Hidden layers can discover and calculate their own learned features. Good for non-linear models.
  • Support Vector Machines: logistic regression with a different error function that encourages your system to make sharp distinctions when classifying data. Also pluggable kernels, which allow you to change the function applied to features being considered to get beyond linear models. I got the impression SVMs with Gaussian kernels are the right choice in practice for many problems.

Unsupervised learning (no expected output on hand)

  • K-means clustering. Group your data into K natural clusters.
  • Principle component analysis. Boil high dimensional data down to fewer dimensions, while measuring and minimizing the loss of meaningful data.
  • Anomaly detection. Find data examples way outside the mean.
  • Collaborative filtering. Learn from user preferences what features of something result in those user preferences.

Practical skills

Practical application of subject matter was the weakest part of the course. The primary new technical skill I learned was Octave. But that feels like an increasingly obsolete tech and I wish I were using a language I’d be using more later. OTOH it was fun programming in a matrix math language with vectorization. I think if I were reimplementing this course today with the same curriculum, I’d consider using R and a notebook environment.

There were some real applied problems like number OCR and spam classification and movie recommendations. Those were interesting to me, but the exercises tended to have us solving one piece of a system to learn the data rather than putting the whole thing together.

I keep thinking it’d be fun to design an alternate machine learning course, one that focusses on the more practical application. Start with you downloading and pre-processing data. Then create a machine learning experimentation pipeline, plugging in out-of-the-box algorithms rather than implementing the algorithms yourself. Then go back and iteratively improve your application of ML, evaluating with learning curves and test data sets and honing the system so it really works.That’s a different course, but if I were doing that I’d do it in Python with scikit-learn and IPython notebooks.

What I missed

As noted above, the main thing I missed is more practical application. But that’s OK, because I think I learned enough to teach myself the practical stuff.

I also wish the course had a broad overview of more algorithms. There are literally thousands of machine learning algorithms in use out there, and while I trust we hit the most important basic concepts I’d have loved a single week which was just a whirlwind tour. Bayesian inference, decision trees, Markov models, … so many options.

Next steps

Now on to applying what I’ve learned to my own data. I already did one little exercise in PCA and clustering with Battlefield 4 data, which was a good experience. I should really try applying SVMs to something next. Maybe League of Legends match data, but it’s hard to get a large sample with the default API rate limiting.

What I really want to do is try applying this stuff to map data. Some machine learning system to improve OpenStreetMap, maybe by identifying anomalies comparing the vector map users generate to raster aerial imagery.

Machine Learning: Photo OCR, real world problem

These last two weeks of the course were so short I finished them all in one day.

I liked this last week, week 11. Most of the lectures were about a real world machine learning problem: finding and OCRing text in photographs.

Photo OCR

The key idea here is the pipeline. He broke the problem up into 4 separate machine learning problems that can be worked on independently.

  1. Find regions of the image that look like text.
  2. Segment text regions into individual characters.
  3. OCR individual characters
  4. Spell correct OCR errors

The real meat of the lesson was the sliding windows algorithm, the technique by which you identify text regions. Basically you look at small square regions of the image and train a recognizer to identify text/not-text. Then you merge adjacent regions of text into single blobs for handing off to step 2, the segmenter. The segmenter also uses a sliding windows analysis to find whitespace between characters, albeit a 1d window.

Lessons for machine learning

Second half of the lesson was reflections on doing better for machine learning.

First question: do you need more data? It may be expensive to get it. More data is most useful for low bias machine learning algorithms, so don’t bother if you have an overfit model. (Or rather, loosen your model first!). He also talked about artificial data synthesis, ways to generate more data from existing labelled training sets. For instance if you’re doing voice recognition, maybe you can generate more training examples by adding real-world distortion to existing examples. However don’t just add random noise, since that doesn’t really train anything useful.

Second question: what should I work on next? The key idea here is ceiling analysis. Basically you go down your pipeline replacing each step with a perfect system. Ie, replace step 1 with a perfect text region classifier; how much does your system improve? Now also replace step 2 with a perfect segementer: how much better does your system do? With that experiment in hand you can identify which steps in the pipeline have the most room for improvement, are worth your time to work on.

Once again I wish there were a programming exercise here to hammer these lessons home. Have us construct our own pipeline and do some testing / segmentation on it. I think this would be a good time to revisit the ideas of training vs. test sets, applying the test set well. Also evaluating systems for bias/variance problems. I suspect the Stanford undergrad class had students doing some final project which was just left out of the Coursera class, perhaps because it would be impossible to grade. Guess it’s up to me to apply this stuff to real problems and make my own mistakes without a teacher to tell me how I’m doing. Unsupervised learning, as it were.

And that’s the course! I’ve got one more blog post coming summarizing what all we covered and what I thought. Overall I’m quite positive, it was a good use of my time.

Machine Learning: large scale learning

Major shift of tone in week 10 of my course. Less detailed algorithms and programming assigments, more of a “here’s some wisdom about real world use”. Well a bit of each, but no programming. Being this high level makes me feel a bit dislocated, I appreciate more now the harder assignments in previous weeks.

Stochastic Gradient Descent

Most of the lecture was on the problem of running machine learning algorithms on enormous data sets, say 100,000,000 examples. These algorithms tend to be of the form “calculate this cost function over all data. then average the cost and update theta, our model parameters”. That’s inefficient when your data is 100M examples! So instead we update theta based on only some of the example set.

Stochastic gradient descent has you doing gradient descent where you update theta once per data example. That means that instead of swiftly following the gradient down to the optimum your optimizer is now doing a bit of a drunken walk, but in the right direction towards the optimum. And with a huge amount of data you can get away with looking at all the data a little less often. Ie: a normal batch gradient descent may do 100 evaluations of all 100M data points. With stochastic you can get away with 1-10 evaluations of the 100M data points. I assume law of large numbers is why. One caveat: it’s important to iterate through your data in a random order, shuffle the data first.

“Mini batch” gradient descent is a middle ground where you calculate the cost over, say, 1000 samples before updating theta. Stochastic looks at 1 sample at a time, batch looks at all N, mini-batch looks at a subset of say 1000. The reason you do this middle ground is because you can still get some benefits of vectorization and/or parallelization.

Convergence is a bit woolly with these stochastic approaches. You still want to monitor convergence, but you really don’t want to calculate the cost function over the whole 100M data samples every time! So instead look at a sliding window of the last, say, 100 cost function calculations you did and monitor it to make sure it’s converging.

Online Learning

Closely related to stochastic machine learning is online learning, where your machine learning model is updating itself continuously in response to new user input. That’s how real websites work. The neat thing about this is your model will adapt to changes in the world over time. Ie: you may have trained a perfect spam classifier in 2013, but spam techniques changed. An online learning system will adapt to new techniques as new data comes in.

Parallelism and map reduce

Last lecture was a funny coda which was basically “oh yeah, you want to run machine learning in parallel on big data!” along with “there’s this thing called map reduce you should know about”. It’s obvious how many ML algorithms are embarrassingly parallel over the input data, so maybe it’s fine to go quickly over this. But a simple map reduce implementation of, say, linear regression would have been a great programming exercise. Perhaps not in Octave though, maybe that’s why they skipped it. I wonder if scikit-learn has support for map reduce? Quick search suggests not really, you roll your own on top of Hadoop.

My first iPython notebook

I finally sat down and learned some IPython for data exploration today. I re-examined some data I had for Battlefield 4 players, something I’d published as my BF4Plots visualization a year or two ago. It’s the stats on ~5000 top Battlefield 4 players, things like their win/loss ratio, skill score, etc.

You can see my notebook in this gist. That’s a static snapshot taken from my last run. If you run it yourself the results will change slightly as some of the algorithms I apply include randomization.

My main goal here was to get more comfortable using IPython. It’s awesome! I really love having the inline, easy to produce graphs. I mostly followed this tutorial for guidance.

Secondarily I was also trying to apply some cluster analysis to my BF4 data to see if I could learn any insights. Not really. This is my first foray into my own machine learning project applying stuff I’ve learned in my course. And it was interesting, I definitely felt I understood what the scikit algorithms were doing. Also a little lost how to get real meaning out of the data. No surprise, that’s the hard part!

The main IPython thing I’m still adjusting to is that it’s a notebook, not a program. I’m so used to iteratively working and re-running my program from scratch. But IPython encourages you to keep a persistent Python VM around and iteratively add to its state. I keep looking for the “Start over and run it all from scratch” button, and there sort of is one, but that’s not really how you’re supposed to work. Which got me into trouble a couple of times. Also I do wonder how IPython people transition their code to running programs for other people to reuse.

Update: a “restart and run all” button was just merged into the main codebase so that’s coming soon. And apparently I’m right that notebooks are more for exploration. When it comes time to create reusable code you create a normal module with some other code editing environment. You can certainly start by pasting your notebook-developed code over, though.

First Python library I learned was Pandas, the data management library. Really all I did was work a bit with DataFrame, which is a data container type that’s a 2d array on steroids. Columns are typed, everything’s nicely addressable and searchable, and it’s all efficient numeric spare arrays behind the scenes. I think it even supports multicore number crunching in a transparent fashion. Really nice library, a joy to use.

Second library I learned was matplotlib, some basic bashing at it to draw some graphs. It’s really great and I wish I’d invested the time to learn this before. But until I saw the HTML notebooks it wasn’t very compelling to me. It’s funny, Mathematica notebooks have been a thing for 20+ years but it’s only the last couple of years where Python caught up.

Final thing I learned was a bit of scikit-learn, in particular how to apply its PCA and K-Means algorithms. Went pretty smoothly. My only annoyance is it seems to only half-support Pandas. It will take DataFrames as inputs but then the objects it returns are basic numpy Arrays which lack column names and some of the nice slicing options. I can’t tell if I’m doing it wrong, it has to be this way, or if it’s just that Pandas is so new that scikit-learn hasn’t fully adopted it yet.

Anyway, all in all my experiment was a big success, very happy. Fun stuff!

Machine Learning: anomaly detection, recommender systems

Week 9 of 11, and the last week with coding homework!

Anomaly Detection

The first unit of lectures was on anomaly detection: picking outliers out of a dataset. Very simple compared to previous algorithms we’ve studied. In the one feature case it boils down to “calculate the mean and standard deviation of the input set. Values way outside the mean are anomalies”. He does define it carefully in terms of Gaussian probability distributions, which is the solid way to do this at least.

So what about multi-featured data? Instead of doing proper multivariate Gaussians, the course focusses on handling vector input data by analyzing a single-variable Gaussian separately for each input feature, then calculating the product of all the probabilities to label the full vector. The result is a special case of the general multivariate Gaussian, in particular it’s a case that only really makes sense if you believe there’s no correlations between your input features. If you do it this limited way you’re stuck having to inspect the data to define new synthetic features, like “height times width” based on domain knowledge that the area of something is more relevant than the height or the width itself.

There are optional lectures on multivariate Gaussians but I’m left wondering why the course just doesn’t start there. Maybe to simplify the math (ha!) or maybe because the special case is more efficient to calculate. He notes the multivariate case doesn’t work well if you have more features that data samples. But if that’s the case, you probably shouldn’t be trying some fancy statistics on your meagre dataset anyway. (The irony is the homework code actually does use the multivariate Gaussian afterall.)

The artful step of applying anomaly detection is determining the threshold for what an anomaly is. Ie: if the datapoint has probability of 0.9 it’s clearly good. 0.1? Probably. But 0.01? Or 0.001? Or 0.0001? How improbable does it have to be before you decide it’s an anomaly? The course has us iteratively learning a good epsilon threshold for discrimination, using a classified test set. But we have the standard deviation of the good data available to us; surely it’d be better to talk about “5 standard deviations out” being a threshold? Or maybe that’s implicit in the Gaussian probability calculation?

The programming homework was awfully simple. The solution for the first part of the first question is literally “mu = mean(X);”. I have come to appreciate how nice Octave is about writing vectorized code for computing with vector and matrix data. So much nicer than writing loops!

Recommender systems

Second unit was on recommender systems, specifically collaborative filtering. This is serious throwback for me; when I started grad school in 1996 my group was all excited because some previous students had just gone off to found Firefly, a company to do collaborative filtering recommendations. They did sort of OK and finally sold to Microsoft 6 years later for a rumored $40M. IIRC the machine learning part wasn’t the real value of the company in the end, but rather some engineering details of how they built it.

Anyway, collaborative filtering is a neat algorithm because it lets a machine learning system really learn something. You give it a big batch of user ratings of, say, movies. The system divines hidden features for those movies that best predict why people have preferences for certain movies. (Ie: maybe it figures out there’s a grouping which is roughly “action movies”). Then you can use these learned features to go back and predict ratings you haven’t seen before.

I particularly like the way Ng presented the algorithm first as an iterative problem. “Use ratings to predict features. Then use those features to predict ratings. Then use those to improve your feature predictions, …”. But in the end it all gets collapsed into a single optimization problem.

OTOH I still hate the math and notation presented in the lectures, it’s just impenetrable. I understand the gist of what it’s doing but what is this supposed to convey? My desire now is that the course not be taught with math notation but instead Octave code notation, so that all those ugly sums and stuff are replaced with simple vector math code samples.


One thing the lecture didn’t take is how you decide how many features to let the system learn. I believe this choice is arbitrary but probably essential to good performance. Someone else already asked in the course discussion. The mentor answered saying basically “try the smallest number that works well on test data”.

I was dreading the homework, since it’s yet again a “translate the confusing math to code” exercise. But having done this kind of thing a few times and with the mentor’s course notes as a guide, it’s pretty straightforward. Particularly to just do a vectorized implementation instead of the for loops the lecture definitions and exercises suggest. Seriously, I think this course got its approach wrong, it’s insane to try to do the calculations unvectorized in a language like Matlab. Both for clarity and performance. The lectures should teach to this approach.

The exercises build up to an actual movie recommender system (using MovieLens data). Neat! Only the homework is flubbed. With the originally supplied code we tell the recommender we liked Alien and Die Hard 2 and then it recommends Star Wars and Braveheart. Makes sense, right? Only the predicted rating I will give these films is 8.9 on a scale of 1–5. Oops. But at least the list of recommended movies is plausible.

Turns out the supplied code forgot to use some normalized values. So I fix that code as instructed and run it again, and now the highest predicted rating is 5. Good! Only now the list of recommended films are completely obscure movies like “Santa with Muscles” and “Aiqing wansui”. I’m left wondering if there’s still a bug in the code or this is a cruel lesson in the confusing vagaries of trained machine learning systems. I looked more closely and the recommendations I expect are there at the top, just not the very top. Still, it’s a disappointing end to the homework problem.

No more programming!

And a disappointing end to the course’s programming assignments. That’s the last one. There’s two more weeks of lectures and short quizzes, but no more coding. I’m sort of glad because hey! no homework! But also sad because I was hoping I’d learn more from the coding assignments.

I’ll say it one last time; I’d love a class that focussed more on application of machine learning systems. I really don’t need to implement the code for neural network back propagation or the fiddly details of collaborative learning. Or rather if I’m going to be doing that, I should be doing it on a graduate level with real math as the foundation. I’m fine with an introductory course being more shallow, but then I wish the programming exercises skipped the half-assed algorithm implementation and focussed on applications, on coaxing good results from these algorithms.

I realize I’m being cranky and truly I’m grateful that the course exists. I certainly appreciate that it’s free! Mostly my criticism is intended as “how could this be better?”.