Machine Learning: k-means clustering, principal component analysis

This week’s lectures were our first foray into what is bizarrely called “unsupervised machine learning”. All our previous algorithms were given data features and correct classification as inputs. This time around all we get are data features, these are algorithms to try to classify / understand the data without guidance. It all seemed much simpler than previous weeks, particularly the assignments, I don’t know if I’m getting smarter or the material is just easier. I vaguely remember enough math from college that this all seemed pretty straightforward.

K-Means Clustering

K_Means clustering is a way to classify a bunch of inputs into a relatively small set of “K” clusters. It’s easy to illustrate visually:

Screen Shot 2015-09-04 at 3.29.05 PM

The dots represent data input points, 2 dimensions of input data. Visually we can see these naturally fall into two clusters, helpfully colored here red and blue. Note the input data to this algorithm is not labelled, we don’t know what’s red and blue. The point of k-means clustering is to discover those red/blue labels. Once k-means has converged the model has two points, here marked in Xs. Those are the centroids of the two clusters. The way the algorithm works is you pick points at random to be your cluster centroids, then iteratively optimize that updating centroids so as to minimize the distance between every point in a cluster and its centroid.

Intuitively this is a pretty simple algorithm, and it is effective. Where people get tripped up is that our intuitions about “distance” and “near” from 2 or 3 dimensional spaces don’t really work for 1000+ dimensional data that’s often what we study. Ng didn’t go over that, just wisdom i’ve picked up from friends.

One thing I hadn’t known is that k-means is useful even when the data isn’t strictly clustered. For instance say you have height/weight of people, a continuous smear of points. Now say you want to sell t-shirts in three sizes: Small, Medium, Large. So cluster your data into three clusters and the centroids are the mean person you should make that size for. Kinda neat.

The art of applying k-means is knowing how to pick K, how many clusters. He presented the “elbow method” which basically lets you keep making K bigger until you don’t get a significant improvement in the error from its application.

Principle Component Analysis

PCA is an algorithm to project a high dimensional dataset down to a lower dimensional dataset. You can use this to project 10,000 dimensional input data down to, say, 1000 dimensions so that then your learning algorithms run faster because there’s less data. Or you can try to project data down to 2 or 3 dimensions for visualization.

The actual PCA algorithm basically boils down to “calculate eigenvectors”. Which Ng sort of said was the case, while not assuming the student knows what an eigenvector is. It was a pretty clear explanation really.

You still have to choose how many dimensions you want in your reduced dataset. His heuristic is “You want 99 percent of value retained”, which means “accept up to 1% error”. He says in practice you can often reduce data to 1/5 to 1/10 the number of dimensions and stay in this threshold.

Homework

Our actual homework this week was really simple, basically writing one-two lines of vector math code at a time to implement the pieces of k-means and PCA algorithms. Not much learning. There was a bunch of optional extra homework assignments applying to real world data. Using k-means clustering to reduce an image palette for instance (find the 16 most important colors). Or use PCA to reduce pixel data from pictures to the important components for facial recognition. That was neat stuff, I wish we’d done that for homework.

Screen Shot 2015-09-04 at 3.36.44 PM

Once again I find myself wanting to do this stuff in Python. PCA would be nice to apply to that Battlefield 4 dataset I collected; it’s ~20 dimensional data, boil it down to the 4 or 5 eigenvectors that really define people’s performance. Then maybe cluster that to segment the population.