Week 6 of my machine learning course.

**Lectures: applying machine learning, system design**

This week’s lecture was about using machine learning algorithms well. Great series of notes, the kind of thing I’m really in this course for. Focus in particular was on figuring out how to improve the results of your learning system and then evaluate the result. You have a lot of options to tweak the model and learning: more properties, more data, different learning rates. Which should you use? He gave a few heuristics better than guessing.

The first thing is splitting data into training and test sets. Keep the evaluation test data separate so you have something to test with that the learning algorithm never saw. Good advice. He also points out that you often use tests on trained systems to pick your trained system; you can’t use that test data (“cross-validation data”) for final evaluation either, since your metalearning system was picked to do well on exactly that data.

The other thing was defining “high bias” models (underfit models, too few parameters) and “high variance” models (overfit models, too many parameters). These are important concepts! Then he gives a test to figure out if your learning system is one or the other, which is basically looking at error rates. Systems that train to low error but then test badly are high variance, overfit. Systems that train to high error and test badly are high bias, underfit. I get the intuition for this and look forward to apply it in the homework.

Finally we moved on to ways to build systems and evaluate them for success. Really nice exhortation to build a quick and dirty system first with the simplest features and data you can get your hands on. Start working with the problem and you’ll understand it better, then you can make an informed decision about what extra data you need. Also an interesting observation about the usefulness of very large datasets. Having an enormous amount of test data means you can use more complex models with less bias, and be confident that you won’t overfit your training data just for the sheer size of that training data set.

Ng defined precision and recall for measuring an algorithm’s correctness, also boiling those two numbers down into a single F-score for ease of comparison. Useful in particular for skewed datasets where you may be classifying something that’s only True 1 iin 1000 times.

**Homework**

This week’s homework has us loop back and implement the very first model we learned, linear regression, then test and evaluate it according to some of what we learned. I didn’t care much for the linear regression part, but the the hands-on application of learning systems and error measurements was great.

The first half of the homework was “implement linear regression learning with all the bells and whistles”. We’d not done that before, only logistic regression, so it was similar and yet different. Not too challenging really, a bit repetitive from previous work, but reinforcement is good.

The second half of the homework was “train a bunch of learning models and measure their error”. Specifically comparing the error on the training set vs. the training on our reserved cross-validation set, and graphing learning curves, etc. This is a sort of meta-machine learning, choosing the parameters for our machine learning system so that we get a well trained model.

We looked first a a truly linear model, trying to fit a straight line to data that’s roughly quadratic. Lots of errors. The learning curve shows us that adding more data points isn’t really going to help make this model more accurate, it is over-biased. So then we switched to an eighth order polynomial model but trained it without regularization. That fits the training data great but fails cross-validation, because it is high variance.

So finally we applied regularization in the learning model, the mechanism that encourages the learning to low-magnitude constants for the polynomials. The encouragement strength is governed by the value lambda. So train lots of models with low lambda and high lambda and see which lambda is best. Lambda = 3, that’s best! At least for this problem.

**Conclusion**

I’m really itching to apply this stuff to my own problems now, to make it real. But I’ve got to solve two problems at once to do that. I need confidence in my own code without the safety net of homework problems with calculated correct answers. And I really want to redo this work in Python, to get out of the weird Octave world. Taking both leaps at once is foolish. I should either bring a problem of my own into Octave once to apply my proven code to new data. Or else port the homework to Python first and verify I still get the right answers to prove my new code. I tried the latter already and got frustrated, should try the former.