Major shift of tone in week 10 of my course. Less detailed algorithms and programming assigments, more of a “here’s some wisdom about real world use”. Well a bit of each, but no programming. Being this high level makes me feel a bit dislocated, I appreciate more now the harder assignments in previous weeks.
Stochastic Gradient Descent
Most of the lecture was on the problem of running machine learning algorithms on enormous data sets, say 100,000,000 examples. These algorithms tend to be of the form “calculate this cost function over all data. then average the cost and update theta, our model parameters”. That’s inefficient when your data is 100M examples! So instead we update theta based on only some of the example set.
Stochastic gradient descent has you doing gradient descent where you update theta once per data example. That means that instead of swiftly following the gradient down to the optimum your optimizer is now doing a bit of a drunken walk, but in the right direction towards the optimum. And with a huge amount of data you can get away with looking at all the data a little less often. Ie: a normal batch gradient descent may do 100 evaluations of all 100M data points. With stochastic you can get away with 1-10 evaluations of the 100M data points. I assume law of large numbers is why. One caveat: it’s important to iterate through your data in a random order, shuffle the data first.
“Mini batch” gradient descent is a middle ground where you calculate the cost over, say, 1000 samples before updating theta. Stochastic looks at 1 sample at a time, batch looks at all N, mini-batch looks at a subset of say 1000. The reason you do this middle ground is because you can still get some benefits of vectorization and/or parallelization.
Convergence is a bit woolly with these stochastic approaches. You still want to monitor convergence, but you really don’t want to calculate the cost function over the whole 100M data samples every time! So instead look at a sliding window of the last, say, 100 cost function calculations you did and monitor it to make sure it’s converging.
Closely related to stochastic machine learning is online learning, where your machine learning model is updating itself continuously in response to new user input. That’s how real websites work. The neat thing about this is your model will adapt to changes in the world over time. Ie: you may have trained a perfect spam classifier in 2013, but spam techniques changed. An online learning system will adapt to new techniques as new data comes in.
Parallelism and map reduce
Last lecture was a funny coda which was basically “oh yeah, you want to run machine learning in parallel on big data!” along with “there’s this thing called map reduce you should know about”. It’s obvious how many ML algorithms are embarrassingly parallel over the input data, so maybe it’s fine to go quickly over this. But a simple map reduce implementation of, say, linear regression would have been a great programming exercise. Perhaps not in Octave though, maybe that’s why they skipped it. I wonder if scikit-learn has support for map reduce? Quick search suggests not really, you roll your own on top of Hadoop.