Machine Learning: anomaly detection, recommender systems

Week 9 of 11, and the last week with coding homework!

Anomaly Detection

The first unit of lectures was on anomaly detection: picking outliers out of a dataset. Very simple compared to previous algorithms we’ve studied. In the one feature case it boils down to “calculate the mean and standard deviation of the input set. Values way outside the mean are anomalies”. He does define it carefully in terms of Gaussian probability distributions, which is the solid way to do this at least.

So what about multi-featured data? Instead of doing proper multivariate Gaussians, the course focusses on handling vector input data by analyzing a single-variable Gaussian separately for each input feature, then calculating the product of all the probabilities to label the full vector. The result is a special case of the general multivariate Gaussian, in particular it’s a case that only really makes sense if you believe there’s no correlations between your input features. If you do it this limited way you’re stuck having to inspect the data to define new synthetic features, like “height times width” based on domain knowledge that the area of something is more relevant than the height or the width itself.

There are optional lectures on multivariate Gaussians but I’m left wondering why the course just doesn’t start there. Maybe to simplify the math (ha!) or maybe because the special case is more efficient to calculate. He notes the multivariate case doesn’t work well if you have more features that data samples. But if that’s the case, you probably shouldn’t be trying some fancy statistics on your meagre dataset anyway. (The irony is the homework code actually does use the multivariate Gaussian afterall.)

The artful step of applying anomaly detection is determining the threshold for what an anomaly is. Ie: if the datapoint has probability of 0.9 it’s clearly good. 0.1? Probably. But 0.01? Or 0.001? Or 0.0001? How improbable does it have to be before you decide it’s an anomaly? The course has us iteratively learning a good epsilon threshold for discrimination, using a classified test set. But we have the standard deviation of the good data available to us; surely it’d be better to talk about “5 standard deviations out” being a threshold? Or maybe that’s implicit in the Gaussian probability calculation?

The programming homework was awfully simple. The solution for the first part of the first question is literally “mu = mean(X);”. I have come to appreciate how nice Octave is about writing vectorized code for computing with vector and matrix data. So much nicer than writing loops!

Recommender systems

Second unit was on recommender systems, specifically collaborative filtering. This is serious throwback for me; when I started grad school in 1996 my group was all excited because some previous students had just gone off to found Firefly, a company to do collaborative filtering recommendations. They did sort of OK and finally sold to Microsoft 6 years later for a rumored $40M. IIRC the machine learning part wasn’t the real value of the company in the end, but rather some engineering details of how they built it.

Anyway, collaborative filtering is a neat algorithm because it lets a machine learning system really learn something. You give it a big batch of user ratings of, say, movies. The system divines hidden features for those movies that best predict why people have preferences for certain movies. (Ie: maybe it figures out there’s a grouping which is roughly “action movies”). Then you can use these learned features to go back and predict ratings you haven’t seen before.

I particularly like the way Ng presented the algorithm first as an iterative problem. “Use ratings to predict features. Then use those features to predict ratings. Then use those to improve your feature predictions, …”. But in the end it all gets collapsed into a single optimization problem.

OTOH I still hate the math and notation presented in the lectures, it’s just impenetrable. I understand the gist of what it’s doing but what is this supposed to convey? My desire now is that the course not be taught with math notation but instead Octave code notation, so that all those ugly sums and stuff are replaced with simple vector math code samples.


One thing the lecture didn’t take is how you decide how many features to let the system learn. I believe this choice is arbitrary but probably essential to good performance. Someone else already asked in the course discussion. The mentor answered saying basically “try the smallest number that works well on test data”.

I was dreading the homework, since it’s yet again a “translate the confusing math to code” exercise. But having done this kind of thing a few times and with the mentor’s course notes as a guide, it’s pretty straightforward. Particularly to just do a vectorized implementation instead of the for loops the lecture definitions and exercises suggest. Seriously, I think this course got its approach wrong, it’s insane to try to do the calculations unvectorized in a language like Matlab. Both for clarity and performance. The lectures should teach to this approach.

The exercises build up to an actual movie recommender system (using MovieLens data). Neat! Only the homework is flubbed. With the originally supplied code we tell the recommender we liked Alien and Die Hard 2 and then it recommends Star Wars and Braveheart. Makes sense, right? Only the predicted rating I will give these films is 8.9 on a scale of 1–5. Oops. But at least the list of recommended movies is plausible.

Turns out the supplied code forgot to use some normalized values. So I fix that code as instructed and run it again, and now the highest predicted rating is 5. Good! Only now the list of recommended films are completely obscure movies like “Santa with Muscles” and “Aiqing wansui”. I’m left wondering if there’s still a bug in the code or this is a cruel lesson in the confusing vagaries of trained machine learning systems. I looked more closely and the recommendations I expect are there at the top, just not the very top. Still, it’s a disappointing end to the homework problem.

No more programming!

And a disappointing end to the course’s programming assignments. That’s the last one. There’s two more weeks of lectures and short quizzes, but no more coding. I’m sort of glad because hey! no homework! But also sad because I was hoping I’d learn more from the coding assignments.

I’ll say it one last time; I’d love a class that focussed more on application of machine learning systems. I really don’t need to implement the code for neural network back propagation or the fiddly details of collaborative learning. Or rather if I’m going to be doing that, I should be doing it on a graduate level with real math as the foundation. I’m fine with an introductory course being more shallow, but then I wish the programming exercises skipped the half-assed algorithm implementation and focussed on applications, on coaxing good results from these algorithms.

I realize I’m being cranky and truly I’m grateful that the course exists. I certainly appreciate that it’s free! Mostly my criticism is intended as “how could this be better?”.