Machine Learning study

My old colleague Karl Rosaen is taking time off from working to study Machine Learning. He’s meticulously documenting his curriculum and day to day work, good resources there.

One big part of Karl’s studying is Raschka’s book Python Machine Learning. It looks like a nicely practical book, I was looking for something like that last year and it wasn’t quite out yet. His example code is online. And his blog post about writing it is nice.

My biggest regret from my machine learning work is not applying it to real problems. Karl mentions that the Kaggle competitions are a good set of problems and datasets to work on.

Machine Learning: reflections on the Coursera course

I’m all done with my Machine Learning course on Coursera, and I wrote a lot about it as I was taking it. Time to look back and put it together.

First off, it was a good use of my time. I complained about aspects of the course and I definitely think it can be improved. But it was a good introduction to what machine learning is and enough hands-on that I feel like I can no go do my own work with real tools. I also mostly liked the Coursera structure, having weekly deadlines and assignments was the structure I needed to actually learn something.

Techniques and Approaches

The thing I was most interested in learning is the general gestalt of machine learning, like how to sit down and get a problem solved. That was covered reasonably well through the course to varying degrees. Some themes:

  • The general structure of “take a training set of data. train a model on it. evaluate the mode against your test set. apply the model”. In some sense all the machine learning systems do the same thing: map a vector of input data to a scalar output value. The various algorithms are just different approaches to learning how to map input data to outputs. I didn’t understand they all fit into this category before.
  • The specific way that the training step is basically just running a function minimizer over your cost function (ie: measure of error). The subtlety here is most minimizers require some understanding of the gradient as well, which means you need to take partial derivatives of your cost function. (I wonder if numeric approximations to the derivative are useful in practice?)
  • The concept of high bias vs high variance models, ie: overfitting the data vs. underfitting the data. Basically if your choice of algorithm has a lot of internal knobs to twiddle, it’s likely you have high variance and might overfit the data, particularly if you don’t have a big data sample. Alternately if your chosen model is too simple the system might not ever predict the data well because it lacks the expressive power. I particularly liked the way you can detect which problem your system has with measuring the learning curve, how well your system learns based on how much data you give it.
  • Precision, recall, and f-score as a way to evaluate the success of a model.
  • A principled way to normalize data so that all your numbers are on comparable scales of roughly -1 to 1, with mean 0.
  • Regularized learning algorithms, with an extra term in the cost function to discourage overfitting.
  • Machine learning pipelines. The way you can segment a broad problem like photo OCR or automatic driving into smaller machine learning problems. Also ceiling analysis to figure out which part of your pipeline could be improved the most.
  • Stochastic machine learning. For very large datasets just iteratively learn on subsets of the data. Naturally leads to online learning, systems that continually learn in response to new data.

Algorithms

Most of the class was a tour through machine learning algorithms. Too much time implementing those algorithms for my tastes, but at least there’s no mystery to many of them now. Things we learned:

Supervised learning (ie: your training data is classified with expected outputs you are trying to predict.)

  • Linear regression: fit a linear model to the data. Predict a single number from a vector of numerical inputs.
  • Logistic regression: fit a logistic model to the data. Predict a single binary classifier from a vector of numerical inputs. (Really it outputs a probability!) Can predict N classes with one-vs-all classes.
  • Neural networks. Fit multi-stage regression to the data. Hidden layers can discover and calculate their own learned features. Good for non-linear models.
  • Support Vector Machines: logistic regression with a different error function that encourages your system to make sharp distinctions when classifying data. Also pluggable kernels, which allow you to change the function applied to features being considered to get beyond linear models. I got the impression SVMs with Gaussian kernels are the right choice in practice for many problems.

Unsupervised learning (no expected output on hand)

  • K-means clustering. Group your data into K natural clusters.
  • Principle component analysis. Boil high dimensional data down to fewer dimensions, while measuring and minimizing the loss of meaningful data.
  • Anomaly detection. Find data examples way outside the mean.
  • Collaborative filtering. Learn from user preferences what features of something result in those user preferences.

Practical skills

Practical application of subject matter was the weakest part of the course. The primary new technical skill I learned was Octave. But that feels like an increasingly obsolete tech and I wish I were using a language I’d be using more later. OTOH it was fun programming in a matrix math language with vectorization. I think if I were reimplementing this course today with the same curriculum, I’d consider using R and a notebook environment.

There were some real applied problems like number OCR and spam classification and movie recommendations. Those were interesting to me, but the exercises tended to have us solving one piece of a system to learn the data rather than putting the whole thing together.

I keep thinking it’d be fun to design an alternate machine learning course, one that focusses on the more practical application. Start with you downloading and pre-processing data. Then create a machine learning experimentation pipeline, plugging in out-of-the-box algorithms rather than implementing the algorithms yourself. Then go back and iteratively improve your application of ML, evaluating with learning curves and test data sets and honing the system so it really works.That’s a different course, but if I were doing that I’d do it in Python with scikit-learn and IPython notebooks.

What I missed

As noted above, the main thing I missed is more practical application. But that’s OK, because I think I learned enough to teach myself the practical stuff.

I also wish the course had a broad overview of more algorithms. There are literally thousands of machine learning algorithms in use out there, and while I trust we hit the most important basic concepts I’d have loved a single week which was just a whirlwind tour. Bayesian inference, decision trees, Markov models, … so many options.

Next steps

Now on to applying what I’ve learned to my own data. I already did one little exercise in PCA and clustering with Battlefield 4 data, which was a good experience. I should really try applying SVMs to something next. Maybe League of Legends match data, but it’s hard to get a large sample with the default API rate limiting.

What I really want to do is try applying this stuff to map data. Some machine learning system to improve OpenStreetMap, maybe by identifying anomalies comparing the vector map users generate to raster aerial imagery.

Machine Learning: Photo OCR, real world problem

These last two weeks of the course were so short I finished them all in one day.

I liked this last week, week 11. Most of the lectures were about a real world machine learning problem: finding and OCRing text in photographs.

Photo OCR

The key idea here is the pipeline. He broke the problem up into 4 separate machine learning problems that can be worked on independently.

  1. Find regions of the image that look like text.
  2. Segment text regions into individual characters.
  3. OCR individual characters
  4. Spell correct OCR errors

The real meat of the lesson was the sliding windows algorithm, the technique by which you identify text regions. Basically you look at small square regions of the image and train a recognizer to identify text/not-text. Then you merge adjacent regions of text into single blobs for handing off to step 2, the segmenter. The segmenter also uses a sliding windows analysis to find whitespace between characters, albeit a 1d window.

Lessons for machine learning

Second half of the lesson was reflections on doing better for machine learning.

First question: do you need more data? It may be expensive to get it. More data is most useful for low bias machine learning algorithms, so don’t bother if you have an overfit model. (Or rather, loosen your model first!). He also talked about artificial data synthesis, ways to generate more data from existing labelled training sets. For instance if you’re doing voice recognition, maybe you can generate more training examples by adding real-world distortion to existing examples. However don’t just add random noise, since that doesn’t really train anything useful.

Second question: what should I work on next? The key idea here is ceiling analysis. Basically you go down your pipeline replacing each step with a perfect system. Ie, replace step 1 with a perfect text region classifier; how much does your system improve? Now also replace step 2 with a perfect segementer: how much better does your system do? With that experiment in hand you can identify which steps in the pipeline have the most room for improvement, are worth your time to work on.

Once again I wish there were a programming exercise here to hammer these lessons home. Have us construct our own pipeline and do some testing / segmentation on it. I think this would be a good time to revisit the ideas of training vs. test sets, applying the test set well. Also evaluating systems for bias/variance problems. I suspect the Stanford undergrad class had students doing some final project which was just left out of the Coursera class, perhaps because it would be impossible to grade. Guess it’s up to me to apply this stuff to real problems and make my own mistakes without a teacher to tell me how I’m doing. Unsupervised learning, as it were.

And that’s the course! I’ve got one more blog post coming summarizing what all we covered and what I thought. Overall I’m quite positive, it was a good use of my time.

Machine Learning: large scale learning

Major shift of tone in week 10 of my course. Less detailed algorithms and programming assigments, more of a “here’s some wisdom about real world use”. Well a bit of each, but no programming. Being this high level makes me feel a bit dislocated, I appreciate more now the harder assignments in previous weeks.

Stochastic Gradient Descent

Most of the lecture was on the problem of running machine learning algorithms on enormous data sets, say 100,000,000 examples. These algorithms tend to be of the form “calculate this cost function over all data. then average the cost and update theta, our model parameters”. That’s inefficient when your data is 100M examples! So instead we update theta based on only some of the example set.

Stochastic gradient descent has you doing gradient descent where you update theta once per data example. That means that instead of swiftly following the gradient down to the optimum your optimizer is now doing a bit of a drunken walk, but in the right direction towards the optimum. And with a huge amount of data you can get away with looking at all the data a little less often. Ie: a normal batch gradient descent may do 100 evaluations of all 100M data points. With stochastic you can get away with 1-10 evaluations of the 100M data points. I assume law of large numbers is why. One caveat: it’s important to iterate through your data in a random order, shuffle the data first.

“Mini batch” gradient descent is a middle ground where you calculate the cost over, say, 1000 samples before updating theta. Stochastic looks at 1 sample at a time, batch looks at all N, mini-batch looks at a subset of say 1000. The reason you do this middle ground is because you can still get some benefits of vectorization and/or parallelization.

Convergence is a bit woolly with these stochastic approaches. You still want to monitor convergence, but you really don’t want to calculate the cost function over the whole 100M data samples every time! So instead look at a sliding window of the last, say, 100 cost function calculations you did and monitor it to make sure it’s converging.

Online Learning

Closely related to stochastic machine learning is online learning, where your machine learning model is updating itself continuously in response to new user input. That’s how real websites work. The neat thing about this is your model will adapt to changes in the world over time. Ie: you may have trained a perfect spam classifier in 2013, but spam techniques changed. An online learning system will adapt to new techniques as new data comes in.

Parallelism and map reduce

Last lecture was a funny coda which was basically “oh yeah, you want to run machine learning in parallel on big data!” along with “there’s this thing called map reduce you should know about”. It’s obvious how many ML algorithms are embarrassingly parallel over the input data, so maybe it’s fine to go quickly over this. But a simple map reduce implementation of, say, linear regression would have been a great programming exercise. Perhaps not in Octave though, maybe that’s why they skipped it. I wonder if scikit-learn has support for map reduce? Quick search suggests not really, you roll your own on top of Hadoop.

My first iPython notebook

I finally sat down and learned some IPython for data exploration today. I re-examined some data I had for Battlefield 4 players, something I’d published as my BF4Plots visualization a year or two ago. It’s the stats on ~5000 top Battlefield 4 players, things like their win/loss ratio, skill score, etc.

You can see my notebook in this gist. That’s a static snapshot taken from my last run. If you run it yourself the results will change slightly as some of the algorithms I apply include randomization.

My main goal here was to get more comfortable using IPython. It’s awesome! I really love having the inline, easy to produce graphs. I mostly followed this tutorial for guidance.

Secondarily I was also trying to apply some cluster analysis to my BF4 data to see if I could learn any insights. Not really. This is my first foray into my own machine learning project applying stuff I’ve learned in my course. And it was interesting, I definitely felt I understood what the scikit algorithms were doing. Also a little lost how to get real meaning out of the data. No surprise, that’s the hard part!

The main IPython thing I’m still adjusting to is that it’s a notebook, not a program. I’m so used to iteratively working and re-running my program from scratch. But IPython encourages you to keep a persistent Python VM around and iteratively add to its state. I keep looking for the “Start over and run it all from scratch” button, and there sort of is one, but that’s not really how you’re supposed to work. Which got me into trouble a couple of times. Also I do wonder how IPython people transition their code to running programs for other people to reuse.

Update: a “restart and run all” button was just merged into the main codebase so that’s coming soon. And apparently I’m right that notebooks are more for exploration. When it comes time to create reusable code you create a normal module with some other code editing environment. You can certainly start by pasting your notebook-developed code over, though.

First Python library I learned was Pandas, the data management library. Really all I did was work a bit with DataFrame, which is a data container type that’s a 2d array on steroids. Columns are typed, everything’s nicely addressable and searchable, and it’s all efficient numeric spare arrays behind the scenes. I think it even supports multicore number crunching in a transparent fashion. Really nice library, a joy to use.

Second library I learned was matplotlib, some basic bashing at it to draw some graphs. It’s really great and I wish I’d invested the time to learn this before. But until I saw the HTML notebooks it wasn’t very compelling to me. It’s funny, Mathematica notebooks have been a thing for 20+ years but it’s only the last couple of years where Python caught up.

Final thing I learned was a bit of scikit-learn, in particular how to apply its PCA and K-Means algorithms. Went pretty smoothly. My only annoyance is it seems to only half-support Pandas. It will take DataFrames as inputs but then the objects it returns are basic numpy Arrays which lack column names and some of the nice slicing options. I can’t tell if I’m doing it wrong, it has to be this way, or if it’s just that Pandas is so new that scikit-learn hasn’t fully adopted it yet.

Anyway, all in all my experiment was a big success, very happy. Fun stuff!

Machine Learning: anomaly detection, recommender systems

Week 9 of 11, and the last week with coding homework!

Anomaly Detection

The first unit of lectures was on anomaly detection: picking outliers out of a dataset. Very simple compared to previous algorithms we’ve studied. In the one feature case it boils down to “calculate the mean and standard deviation of the input set. Values way outside the mean are anomalies”. He does define it carefully in terms of Gaussian probability distributions, which is the solid way to do this at least.

So what about multi-featured data? Instead of doing proper multivariate Gaussians, the course focusses on handling vector input data by analyzing a single-variable Gaussian separately for each input feature, then calculating the product of all the probabilities to label the full vector. The result is a special case of the general multivariate Gaussian, in particular it’s a case that only really makes sense if you believe there’s no correlations between your input features. If you do it this limited way you’re stuck having to inspect the data to define new synthetic features, like “height times width” based on domain knowledge that the area of something is more relevant than the height or the width itself.

There are optional lectures on multivariate Gaussians but I’m left wondering why the course just doesn’t start there. Maybe to simplify the math (ha!) or maybe because the special case is more efficient to calculate. He notes the multivariate case doesn’t work well if you have more features that data samples. But if that’s the case, you probably shouldn’t be trying some fancy statistics on your meagre dataset anyway. (The irony is the homework code actually does use the multivariate Gaussian afterall.)

The artful step of applying anomaly detection is determining the threshold for what an anomaly is. Ie: if the datapoint has probability of 0.9 it’s clearly good. 0.1? Probably. But 0.01? Or 0.001? Or 0.0001? How improbable does it have to be before you decide it’s an anomaly? The course has us iteratively learning a good epsilon threshold for discrimination, using a classified test set. But we have the standard deviation of the good data available to us; surely it’d be better to talk about “5 standard deviations out” being a threshold? Or maybe that’s implicit in the Gaussian probability calculation?

The programming homework was awfully simple. The solution for the first part of the first question is literally “mu = mean(X);”. I have come to appreciate how nice Octave is about writing vectorized code for computing with vector and matrix data. So much nicer than writing loops!

Recommender systems

Second unit was on recommender systems, specifically collaborative filtering. This is serious throwback for me; when I started grad school in 1996 my group was all excited because some previous students had just gone off to found Firefly, a company to do collaborative filtering recommendations. They did sort of OK and finally sold to Microsoft 6 years later for a rumored $40M. IIRC the machine learning part wasn’t the real value of the company in the end, but rather some engineering details of how they built it.

Anyway, collaborative filtering is a neat algorithm because it lets a machine learning system really learn something. You give it a big batch of user ratings of, say, movies. The system divines hidden features for those movies that best predict why people have preferences for certain movies. (Ie: maybe it figures out there’s a grouping which is roughly “action movies”). Then you can use these learned features to go back and predict ratings you haven’t seen before.

I particularly like the way Ng presented the algorithm first as an iterative problem. “Use ratings to predict features. Then use those features to predict ratings. Then use those to improve your feature predictions, …”. But in the end it all gets collapsed into a single optimization problem.

OTOH I still hate the math and notation presented in the lectures, it’s just impenetrable. I understand the gist of what it’s doing but what is this supposed to convey? My desire now is that the course not be taught with math notation but instead Octave code notation, so that all those ugly sums and stuff are replaced with simple vector math code samples.

ss

One thing the lecture didn’t take is how you decide how many features to let the system learn. I believe this choice is arbitrary but probably essential to good performance. Someone else already asked in the course discussion. The mentor answered saying basically “try the smallest number that works well on test data”.

I was dreading the homework, since it’s yet again a “translate the confusing math to code” exercise. But having done this kind of thing a few times and with the mentor’s course notes as a guide, it’s pretty straightforward. Particularly to just do a vectorized implementation instead of the for loops the lecture definitions and exercises suggest. Seriously, I think this course got its approach wrong, it’s insane to try to do the calculations unvectorized in a language like Matlab. Both for clarity and performance. The lectures should teach to this approach.

The exercises build up to an actual movie recommender system (using MovieLens data). Neat! Only the homework is flubbed. With the originally supplied code we tell the recommender we liked Alien and Die Hard 2 and then it recommends Star Wars and Braveheart. Makes sense, right? Only the predicted rating I will give these films is 8.9 on a scale of 1–5. Oops. But at least the list of recommended movies is plausible.

Turns out the supplied code forgot to use some normalized values. So I fix that code as instructed and run it again, and now the highest predicted rating is 5. Good! Only now the list of recommended films are completely obscure movies like “Santa with Muscles” and “Aiqing wansui”. I’m left wondering if there’s still a bug in the code or this is a cruel lesson in the confusing vagaries of trained machine learning systems. I looked more closely and the recommendations I expect are there at the top, just not the very top. Still, it’s a disappointing end to the homework problem.

No more programming!

And a disappointing end to the course’s programming assignments. That’s the last one. There’s two more weeks of lectures and short quizzes, but no more coding. I’m sort of glad because hey! no homework! But also sad because I was hoping I’d learn more from the coding assignments.

I’ll say it one last time; I’d love a class that focussed more on application of machine learning systems. I really don’t need to implement the code for neural network back propagation or the fiddly details of collaborative learning. Or rather if I’m going to be doing that, I should be doing it on a graduate level with real math as the foundation. I’m fine with an introductory course being more shallow, but then I wish the programming exercises skipped the half-assed algorithm implementation and focussed on applications, on coaxing good results from these algorithms.

I realize I’m being cranky and truly I’m grateful that the course exists. I certainly appreciate that it’s free! Mostly my criticism is intended as “how could this be better?”.

Machine Learning: k-means clustering, principal component analysis

This week’s lectures were our first foray into what is bizarrely called “unsupervised machine learning”. All our previous algorithms were given data features and correct classification as inputs. This time around all we get are data features, these are algorithms to try to classify / understand the data without guidance. It all seemed much simpler than previous weeks, particularly the assignments, I don’t know if I’m getting smarter or the material is just easier. I vaguely remember enough math from college that this all seemed pretty straightforward.

K-Means Clustering

K_Means clustering is a way to classify a bunch of inputs into a relatively small set of “K” clusters. It’s easy to illustrate visually:

Screen Shot 2015-09-04 at 3.29.05 PM

The dots represent data input points, 2 dimensions of input data. Visually we can see these naturally fall into two clusters, helpfully colored here red and blue. Note the input data to this algorithm is not labelled, we don’t know what’s red and blue. The point of k-means clustering is to discover those red/blue labels. Once k-means has converged the model has two points, here marked in Xs. Those are the centroids of the two clusters. The way the algorithm works is you pick points at random to be your cluster centroids, then iteratively optimize that updating centroids so as to minimize the distance between every point in a cluster and its centroid.

Intuitively this is a pretty simple algorithm, and it is effective. Where people get tripped up is that our intuitions about “distance” and “near” from 2 or 3 dimensional spaces don’t really work for 1000+ dimensional data that’s often what we study. Ng didn’t go over that, just wisdom i’ve picked up from friends.

One thing I hadn’t known is that k-means is useful even when the data isn’t strictly clustered. For instance say you have height/weight of people, a continuous smear of points. Now say you want to sell t-shirts in three sizes: Small, Medium, Large. So cluster your data into three clusters and the centroids are the mean person you should make that size for. Kinda neat.

The art of applying k-means is knowing how to pick K, how many clusters. He presented the “elbow method” which basically lets you keep making K bigger until you don’t get a significant improvement in the error from its application.

Principle Component Analysis

PCA is an algorithm to project a high dimensional dataset down to a lower dimensional dataset. You can use this to project 10,000 dimensional input data down to, say, 1000 dimensions so that then your learning algorithms run faster because there’s less data. Or you can try to project data down to 2 or 3 dimensions for visualization.

The actual PCA algorithm basically boils down to “calculate eigenvectors”. Which Ng sort of said was the case, while not assuming the student knows what an eigenvector is. It was a pretty clear explanation really.

You still have to choose how many dimensions you want in your reduced dataset. His heuristic is “You want 99 percent of value retained”, which means “accept up to 1% error”. He says in practice you can often reduce data to 1/5 to 1/10 the number of dimensions and stay in this threshold.

Homework

Our actual homework this week was really simple, basically writing one-two lines of vector math code at a time to implement the pieces of k-means and PCA algorithms. Not much learning. There was a bunch of optional extra homework assignments applying to real world data. Using k-means clustering to reduce an image palette for instance (find the 16 most important colors). Or use PCA to reduce pixel data from pictures to the important components for facial recognition. That was neat stuff, I wish we’d done that for homework.

Screen Shot 2015-09-04 at 3.36.44 PM

Once again I find myself wanting to do this stuff in Python. PCA would be nice to apply to that Battlefield 4 dataset I collected; it’s ~20 dimensional data, boil it down to the 4 or 5 eigenvectors that really define people’s performance. Then maybe cluster that to segment the population.

Machine Learning: Support Vector Machines

After weeks of complaining my machine learning course had too much detail on the math and guts of ML algorithms and not enough on applying ML to real data, I got my wish. This week’s course on Support Vector Machines (SVMs) was pretty high level and breezy, and the homework even more so. And now I’m dissatisfied, feel like I didn’t get my week’s learning’s worth! Not that I’m complaining to have an easy week, but I wish there were a bit more brain-bending around application to replace all brain bending I was doing before figuring out the vector math and programming.

PS: I’ve created a Machine Learning category on the blog. And mucked with the theme. Boy, wordpress.com is super buggy and badly product managed.

Lectures

The main topic this week was Support Vector Machines, a supervised learning technique that was framed as being more powerful and practical in application than the previous techniques we used like linear regression, logistic regression, and neural networks.

Conceptually it works a lot like these other supervised learning systems. You define a cost function, but unlike the logistic cost function this one biases the system to prefer “large margins”. Ie basically it’s not enough to say “49% chance it’s category A”, the training is encouraged to say “10% chance it’s category A” instead. You control this bias/overfit tradeoff with the parameters C and sigma. C is much like our lambda from before, a damping factor to encourage small constants. (Confusingly, C = 1/lambda). sigma shapes the cost function curve itself, at least in the usual Gaussian kernel.

Oh yes, kernels. This is a neat trick where you can modify your input features with various functions. A linear kernel gives you linear regression (exactly? or more or less?). The Gaussian kernal is a good general purpose kernel. There’s a bunch of other kernals people use. And some magic allows this pluggable kernel function to be applied efficiently, so the training system runs quickly.

That’s where the lectures got a bit hazy. There was one whole 20 minute video dedicated to the math which was prefaced with “this is optional, and you should never try to implement this; use libsvm instead”. The main thing I learned from this is you can still understand speech when watching a video played back at 2.2x speed, even though the individual phonemes are no longer distinguishable. He skimmed over the math quite a bit, nothing was lost by ignoring it.

I never did quite understand why SVMs are better. We were advised they are best in applications with lots of training examples relative to the size of the feature set. Logistic regression is better if you have lots of features and few training examples. And neural networks may be better than SVMs too, but take longer to train. ¯\_(ツ)_/¯

Homework

The homework this week was super-easy, all about applying SVM to problems. Each of the 4 exercises was like 3 lines of code. The only hard part was learning Octave syntax for doing things like “find the element in the list”, which once again made me wish we were using Python.

Anyway we implemented two learning exercises. An artificial problem of “find the boundary between Xs and Os for 2d input points”. And a real problem of “build a spam classifier trained on this data set”. We were supplied with the SVM ML system itself, so all we had to do was write simple functions to compute the Gaussian kernel and boil a list of stemmed words down into a vector coding. It was kind of dumb.

The most useful assignment was writing a loop to try out a training system with various values of C and sigma, the tuning parameters for the SVM. And experimentally determine what values gave the most accurate trained model. I imagine this is the kind of thing you do in the real world frequently, and doing it well is an art.

The spam filter problem was also fun because it felt real. Took inputs from the SpamAssassin spam corpus. Used their code to crunch that text down to stemmed words, which we then coded into feature vectors. Push it through SVM and you end up with a system that classifies 98.6% of spam correctly! Which is not so great, I think 99.99% accuracy is minimal for a useful spam system. And even then you really want to measure that false positive rate carefully. But I had a whisper of an idea of how to apply this stuff to a real problem that is really solved with ML systems like what we are studying, and that was fun.

Application

Once again I find myself wanting to do this in Python. I really want a class which is “applied Machine Learning in Python”. I guess focussing on how to use Pandas and scikit-learn. Maybe someone has written that up already?

Machine learning: picking the right system

Week 6 of my machine learning course.

Lectures: applying machine learning, system design

This week’s lecture was about using machine learning algorithms well. Great series of notes, the kind of thing I’m really in this course for. Focus in particular was on figuring out how to improve the results of your learning system and then evaluate the result. You have a lot of options to tweak the model and learning: more properties, more data, different learning rates. Which should you use? He gave a few heuristics better than guessing.

The first thing is splitting data into training and test sets. Keep the evaluation test data separate so you have something to test with that the learning algorithm never saw. Good advice. He also points out that you often use tests on trained systems to pick your trained system; you can’t use that test data (“cross-validation data”) for final evaluation either, since your metalearning system was picked to do well on exactly that data.

The other thing was defining “high bias” models (underfit models, too few parameters) and “high variance” models (overfit models, too many parameters). These are important concepts! Then he gives a test to figure out if your learning system is one or the other, which is basically looking at error rates. Systems that train to low error but then test badly are high variance, overfit. Systems that train to high error and test badly are high bias, underfit. I get the intuition for this and look forward to apply it in the homework.

Finally we moved on to ways to build systems and evaluate them for success. Really nice exhortation to build a quick and dirty system first with the simplest features and data you can get your hands on. Start working with the problem and you’ll understand it better, then you can make an informed decision about what extra data you need. Also an interesting observation about the usefulness of very large datasets. Having an enormous amount of test data means you can use more complex models with less bias, and be confident that you won’t overfit your training data just for the sheer size of that training data set.

Ng defined precision and recall for measuring an algorithm’s correctness, also boiling those two numbers down into a single F-score for ease of comparison. Useful in particular for skewed datasets where you may be classifying something that’s only True 1 iin 1000 times.

Homework

This week’s homework has us loop back and implement the very first model we learned, linear regression, then test and evaluate it according to some of what we learned. I didn’t care much for the linear regression part, but the the hands-on application of learning systems and error measurements was great.

The first half of the homework was “implement linear regression learning with all the bells and whistles”. We’d not done that before, only logistic regression, so it was similar and yet different. Not too challenging really, a bit repetitive from previous work, but reinforcement is good.

The second half of the homework was “train a bunch of learning models and measure their error”. Specifically comparing the error on the training set vs. the training on our reserved cross-validation set, and graphing learning curves, etc. This is a sort of meta-machine learning, choosing the parameters for our machine learning system so that we get a well trained model.

We looked first a a truly linear model, trying to fit a straight line to data that’s roughly quadratic. Lots of errors. The learning curve shows us that adding more data points isn’t really going to help make this model more accurate, it is over-biased. So then we switched to an eighth order polynomial model but trained it without regularization. That fits the training data great but fails cross-validation, because it is high variance.

So finally we applied regularization in the learning model, the mechanism that encourages the learning to low-magnitude constants for the polynomials. The encouragement strength is governed by the value lambda. So train lots of models with low lambda and high lambda and see which lambda is best. Lambda = 3, that’s best! At least for this problem.

Conclusion

I’m really itching to apply this stuff to my own problems now, to make it real. But I’ve got to solve two problems at once to do that. I need confidence in my own code without the safety net of homework problems with calculated correct answers. And I really want to redo this work in Python, to get out of the weird Octave world. Taking both leaps at once is foolish. I should either bring a problem of my own into Octave once to apply my proven code to new data. Or else port the homework to Python first and verify I still get the right answers to prove my new code. I tried the latter already and got frustrated, should try the former.