First off, it was a good use of my time. I complained about aspects of the course and I definitely think it can be improved. But it was a good introduction to what machine learning is and enough hands-on that I feel like I can no go do my own work with real tools. I also mostly liked the Coursera structure, having weekly deadlines and assignments was the structure I needed to actually learn something.
Techniques and Approaches
The thing I was most interested in learning is the general gestalt of machine learning, like how to sit down and get a problem solved. That was covered reasonably well through the course to varying degrees. Some themes:
- The general structure of “take a training set of data. train a model on it. evaluate the mode against your test set. apply the model”. In some sense all the machine learning systems do the same thing: map a vector of input data to a scalar output value. The various algorithms are just different approaches to learning how to map input data to outputs. I didn’t understand they all fit into this category before.
- The specific way that the training step is basically just running a function minimizer over your cost function (ie: measure of error). The subtlety here is most minimizers require some understanding of the gradient as well, which means you need to take partial derivatives of your cost function. (I wonder if numeric approximations to the derivative are useful in practice?)
- The concept of high bias vs high variance models, ie: overfitting the data vs. underfitting the data. Basically if your choice of algorithm has a lot of internal knobs to twiddle, it’s likely you have high variance and might overfit the data, particularly if you don’t have a big data sample. Alternately if your chosen model is too simple the system might not ever predict the data well because it lacks the expressive power. I particularly liked the way you can detect which problem your system has with measuring the learning curve, how well your system learns based on how much data you give it.
- Precision, recall, and f-score as a way to evaluate the success of a model.
- A principled way to normalize data so that all your numbers are on comparable scales of roughly -1 to 1, with mean 0.
- Regularized learning algorithms, with an extra term in the cost function to discourage overfitting.
- Machine learning pipelines. The way you can segment a broad problem like photo OCR or automatic driving into smaller machine learning problems. Also ceiling analysis to figure out which part of your pipeline could be improved the most.
- Stochastic machine learning. For very large datasets just iteratively learn on subsets of the data. Naturally leads to online learning, systems that continually learn in response to new data.
Most of the class was a tour through machine learning algorithms. Too much time implementing those algorithms for my tastes, but at least there’s no mystery to many of them now. Things we learned:
Supervised learning (ie: your training data is classified with expected outputs you are trying to predict.)
- Linear regression: fit a linear model to the data. Predict a single number from a vector of numerical inputs.
- Logistic regression: fit a logistic model to the data. Predict a single binary classifier from a vector of numerical inputs. (Really it outputs a probability!) Can predict N classes with one-vs-all classes.
- Neural networks. Fit multi-stage regression to the data. Hidden layers can discover and calculate their own learned features. Good for non-linear models.
- Support Vector Machines: logistic regression with a different error function that encourages your system to make sharp distinctions when classifying data. Also pluggable kernels, which allow you to change the function applied to features being considered to get beyond linear models. I got the impression SVMs with Gaussian kernels are the right choice in practice for many problems.
Unsupervised learning (no expected output on hand)
- K-means clustering. Group your data into K natural clusters.
- Principle component analysis. Boil high dimensional data down to fewer dimensions, while measuring and minimizing the loss of meaningful data.
- Anomaly detection. Find data examples way outside the mean.
- Collaborative filtering. Learn from user preferences what features of something result in those user preferences.
Practical application of subject matter was the weakest part of the course. The primary new technical skill I learned was Octave. But that feels like an increasingly obsolete tech and I wish I were using a language I’d be using more later. OTOH it was fun programming in a matrix math language with vectorization. I think if I were reimplementing this course today with the same curriculum, I’d consider using R and a notebook environment.
There were some real applied problems like number OCR and spam classification and movie recommendations. Those were interesting to me, but the exercises tended to have us solving one piece of a system to learn the data rather than putting the whole thing together.
I keep thinking it’d be fun to design an alternate machine learning course, one that focusses on the more practical application. Start with you downloading and pre-processing data. Then create a machine learning experimentation pipeline, plugging in out-of-the-box algorithms rather than implementing the algorithms yourself. Then go back and iteratively improve your application of ML, evaluating with learning curves and test data sets and honing the system so it really works.That’s a different course, but if I were doing that I’d do it in Python with scikit-learn and IPython notebooks.
What I missed
As noted above, the main thing I missed is more practical application. But that’s OK, because I think I learned enough to teach myself the practical stuff.
I also wish the course had a broad overview of more algorithms. There are literally thousands of machine learning algorithms in use out there, and while I trust we hit the most important basic concepts I’d have loved a single week which was just a whirlwind tour. Bayesian inference, decision trees, Markov models, … so many options.
Now on to applying what I’ve learned to my own data. I already did one little exercise in PCA and clustering with Battlefield 4 data, which was a good experience. I should really try applying SVMs to something next. Maybe League of Legends match data, but it’s hard to get a large sample with the default API rate limiting.
What I really want to do is try applying this stuff to map data. Some machine learning system to improve OpenStreetMap, maybe by identifying anomalies comparing the vector map users generate to raster aerial imagery.