Machine Learning: k-means clustering, principal component analysis

This week’s lectures were our first foray into what is bizarrely called “unsupervised machine learning”. All our previous algorithms were given data features and correct classification as inputs. This time around all we get are data features, these are algorithms to try to classify / understand the data without guidance. It all seemed much simpler than previous weeks, particularly the assignments, I don’t know if I’m getting smarter or the material is just easier. I vaguely remember enough math from college that this all seemed pretty straightforward.

K-Means Clustering

K_Means clustering is a way to classify a bunch of inputs into a relatively small set of “K” clusters. It’s easy to illustrate visually:

Screen Shot 2015-09-04 at 3.29.05 PM

The dots represent data input points, 2 dimensions of input data. Visually we can see these naturally fall into two clusters, helpfully colored here red and blue. Note the input data to this algorithm is not labelled, we don’t know what’s red and blue. The point of k-means clustering is to discover those red/blue labels. Once k-means has converged the model has two points, here marked in Xs. Those are the centroids of the two clusters. The way the algorithm works is you pick points at random to be your cluster centroids, then iteratively optimize that updating centroids so as to minimize the distance between every point in a cluster and its centroid.

Intuitively this is a pretty simple algorithm, and it is effective. Where people get tripped up is that our intuitions about “distance” and “near” from 2 or 3 dimensional spaces don’t really work for 1000+ dimensional data that’s often what we study. Ng didn’t go over that, just wisdom i’ve picked up from friends.

One thing I hadn’t known is that k-means is useful even when the data isn’t strictly clustered. For instance say you have height/weight of people, a continuous smear of points. Now say you want to sell t-shirts in three sizes: Small, Medium, Large. So cluster your data into three clusters and the centroids are the mean person you should make that size for. Kinda neat.

The art of applying k-means is knowing how to pick K, how many clusters. He presented the “elbow method” which basically lets you keep making K bigger until you don’t get a significant improvement in the error from its application.

Principle Component Analysis

PCA is an algorithm to project a high dimensional dataset down to a lower dimensional dataset. You can use this to project 10,000 dimensional input data down to, say, 1000 dimensions so that then your learning algorithms run faster because there’s less data. Or you can try to project data down to 2 or 3 dimensions for visualization.

The actual PCA algorithm basically boils down to “calculate eigenvectors”. Which Ng sort of said was the case, while not assuming the student knows what an eigenvector is. It was a pretty clear explanation really.

You still have to choose how many dimensions you want in your reduced dataset. His heuristic is “You want 99 percent of value retained”, which means “accept up to 1% error”. He says in practice you can often reduce data to 1/5 to 1/10 the number of dimensions and stay in this threshold.


Our actual homework this week was really simple, basically writing one-two lines of vector math code at a time to implement the pieces of k-means and PCA algorithms. Not much learning. There was a bunch of optional extra homework assignments applying to real world data. Using k-means clustering to reduce an image palette for instance (find the 16 most important colors). Or use PCA to reduce pixel data from pictures to the important components for facial recognition. That was neat stuff, I wish we’d done that for homework.

Screen Shot 2015-09-04 at 3.36.44 PM

Once again I find myself wanting to do this stuff in Python. PCA would be nice to apply to that Battlefield 4 dataset I collected; it’s ~20 dimensional data, boil it down to the 4 or 5 eigenvectors that really define people’s performance. Then maybe cluster that to segment the population.

Machine Learning: Support Vector Machines

After weeks of complaining my machine learning course had too much detail on the math and guts of ML algorithms and not enough on applying ML to real data, I got my wish. This week’s course on Support Vector Machines (SVMs) was pretty high level and breezy, and the homework even more so. And now I’m dissatisfied, feel like I didn’t get my week’s learning’s worth! Not that I’m complaining to have an easy week, but I wish there were a bit more brain-bending around application to replace all brain bending I was doing before figuring out the vector math and programming.

PS: I’ve created a Machine Learning category on the blog. And mucked with the theme. Boy, is super buggy and badly product managed.


The main topic this week was Support Vector Machines, a supervised learning technique that was framed as being more powerful and practical in application than the previous techniques we used like linear regression, logistic regression, and neural networks.

Conceptually it works a lot like these other supervised learning systems. You define a cost function, but unlike the logistic cost function this one biases the system to prefer “large margins”. Ie basically it’s not enough to say “49% chance it’s category A”, the training is encouraged to say “10% chance it’s category A” instead. You control this bias/overfit tradeoff with the parameters C and sigma. C is much like our lambda from before, a damping factor to encourage small constants. (Confusingly, C = 1/lambda). sigma shapes the cost function curve itself, at least in the usual Gaussian kernel.

Oh yes, kernels. This is a neat trick where you can modify your input features with various functions. A linear kernel gives you linear regression (exactly? or more or less?). The Gaussian kernal is a good general purpose kernel. There’s a bunch of other kernals people use. And some magic allows this pluggable kernel function to be applied efficiently, so the training system runs quickly.

That’s where the lectures got a bit hazy. There was one whole 20 minute video dedicated to the math which was prefaced with “this is optional, and you should never try to implement this; use libsvm instead”. The main thing I learned from this is you can still understand speech when watching a video played back at 2.2x speed, even though the individual phonemes are no longer distinguishable. He skimmed over the math quite a bit, nothing was lost by ignoring it.

I never did quite understand why SVMs are better. We were advised they are best in applications with lots of training examples relative to the size of the feature set. Logistic regression is better if you have lots of features and few training examples. And neural networks may be better than SVMs too, but take longer to train. ¯\_(ツ)_/¯


The homework this week was super-easy, all about applying SVM to problems. Each of the 4 exercises was like 3 lines of code. The only hard part was learning Octave syntax for doing things like “find the element in the list”, which once again made me wish we were using Python.

Anyway we implemented two learning exercises. An artificial problem of “find the boundary between Xs and Os for 2d input points”. And a real problem of “build a spam classifier trained on this data set”. We were supplied with the SVM ML system itself, so all we had to do was write simple functions to compute the Gaussian kernel and boil a list of stemmed words down into a vector coding. It was kind of dumb.

The most useful assignment was writing a loop to try out a training system with various values of C and sigma, the tuning parameters for the SVM. And experimentally determine what values gave the most accurate trained model. I imagine this is the kind of thing you do in the real world frequently, and doing it well is an art.

The spam filter problem was also fun because it felt real. Took inputs from the SpamAssassin spam corpus. Used their code to crunch that text down to stemmed words, which we then coded into feature vectors. Push it through SVM and you end up with a system that classifies 98.6% of spam correctly! Which is not so great, I think 99.99% accuracy is minimal for a useful spam system. And even then you really want to measure that false positive rate carefully. But I had a whisper of an idea of how to apply this stuff to a real problem that is really solved with ML systems like what we are studying, and that was fun.


Once again I find myself wanting to do this in Python. I really want a class which is “applied Machine Learning in Python”. I guess focussing on how to use Pandas and scikit-learn. Maybe someone has written that up already?

Machine learning: picking the right system

Week 6 of my machine learning course.

Lectures: applying machine learning, system design

This week’s lecture was about using machine learning algorithms well. Great series of notes, the kind of thing I’m really in this course for. Focus in particular was on figuring out how to improve the results of your learning system and then evaluate the result. You have a lot of options to tweak the model and learning: more properties, more data, different learning rates. Which should you use? He gave a few heuristics better than guessing.

The first thing is splitting data into training and test sets. Keep the evaluation test data separate so you have something to test with that the learning algorithm never saw. Good advice. He also points out that you often use tests on trained systems to pick your trained system; you can’t use that test data (“cross-validation data”) for final evaluation either, since your metalearning system was picked to do well on exactly that data.

The other thing was defining “high bias” models (underfit models, too few parameters) and “high variance” models (overfit models, too many parameters). These are important concepts! Then he gives a test to figure out if your learning system is one or the other, which is basically looking at error rates. Systems that train to low error but then test badly are high variance, overfit. Systems that train to high error and test badly are high bias, underfit. I get the intuition for this and look forward to apply it in the homework.

Finally we moved on to ways to build systems and evaluate them for success. Really nice exhortation to build a quick and dirty system first with the simplest features and data you can get your hands on. Start working with the problem and you’ll understand it better, then you can make an informed decision about what extra data you need. Also an interesting observation about the usefulness of very large datasets. Having an enormous amount of test data means you can use more complex models with less bias, and be confident that you won’t overfit your training data just for the sheer size of that training data set.

Ng defined precision and recall for measuring an algorithm’s correctness, also boiling those two numbers down into a single F-score for ease of comparison. Useful in particular for skewed datasets where you may be classifying something that’s only True 1 iin 1000 times.


This week’s homework has us loop back and implement the very first model we learned, linear regression, then test and evaluate it according to some of what we learned. I didn’t care much for the linear regression part, but the the hands-on application of learning systems and error measurements was great.

The first half of the homework was “implement linear regression learning with all the bells and whistles”. We’d not done that before, only logistic regression, so it was similar and yet different. Not too challenging really, a bit repetitive from previous work, but reinforcement is good.

The second half of the homework was “train a bunch of learning models and measure their error”. Specifically comparing the error on the training set vs. the training on our reserved cross-validation set, and graphing learning curves, etc. This is a sort of meta-machine learning, choosing the parameters for our machine learning system so that we get a well trained model.

We looked first a a truly linear model, trying to fit a straight line to data that’s roughly quadratic. Lots of errors. The learning curve shows us that adding more data points isn’t really going to help make this model more accurate, it is over-biased. So then we switched to an eighth order polynomial model but trained it without regularization. That fits the training data great but fails cross-validation, because it is high variance.

So finally we applied regularization in the learning model, the mechanism that encourages the learning to low-magnitude constants for the polynomials. The encouragement strength is governed by the value lambda. So train lots of models with low lambda and high lambda and see which lambda is best. Lambda = 3, that’s best! At least for this problem.


I’m really itching to apply this stuff to my own problems now, to make it real. But I’ve got to solve two problems at once to do that. I need confidence in my own code without the safety net of homework problems with calculated correct answers. And I really want to redo this work in Python, to get out of the weird Octave world. Taking both leaps at once is foolish. I should either bring a problem of my own into Octave once to apply my proven code to new data. Or else port the homework to Python first and verify I still get the right answers to prove my new code. I tried the latter already and got frustrated, should try the former.

Machine learning: backpropagation shitshow

Frustrating week in the machine learning course. We learned how to implement backpropagation in neural networks. Or at least we pretended to.


Ng’s lectures skate over the math pretty fast and there’s like 3 separate snippets reassuring “if you don’t understand this don’t worry, it’s hard and I don’t always remember either lol!”. I think this course tried to walk the line between either really doing the math or not and fell in a gap in the middle. I’d prefer to have skipped the math entirely. I kind of did, honestly, and fell back on my “how do I fake this” survival skills from the few college math classes I didn’t like.

The useful part is Ng also emphasizes the intuitions and mechanical logic of what backpropagation is doing, and that I think I understand. Forward propagation is creating a mesh of hidden layers that map inputs to outputs with hidden weights. Backprop starts at the output layer and observes the error, how far the network’s output is off from expected output. It then tries to apportion that error to the previous layer’s nodes by spreading the error backwards, weighted by the activation parameters. And so error is propagated backwards through each layer and in the end you have some idea how much each activation weight contributed to the mistakes the output made.

More concretely, backpropagation gives you a cost function J and its partial derivatives where each activation weight in the neural network gets a fair share of blame for the error. With cost function in hand you can run a function optimizer to find the activation weights that minimize the cost. (One fact Ng skated over; this cost function is no longer convex, which means the optimizer might get stuck in local minima. He mentions the problem and says not to worry about it in practice. Um, ok).

One fun thing in optimizing neural networks: if you start one out with node having constant activation weights (say, all zero) then all the nodes look the same and error is apportioned equally, so nodes don’t train separately. To break this degenerate symmetry you start training the network with randomized weights. Reminds me of the Big Bang and the lumpy cosmic microwave background.


The homework programming exercises are the only way I feel like I’m really learning anything. They’re reasonably good.

However, this week’s lecture + assignment clearly specify “don’t try to vectorize over the 5000 training inputs, just use a loop”. But then the tutorial that’s my real lifeline to implementing these assignments is a vectorized implementation! Which is it, teacher? I’ve mostly preferred implementing the vectorized versions, both because it’s faster to run and because it’s a closer match to the math equations from the lectures. It’s sure nice avoiding writing lots of for loops. But the teaching materials really need to decide to either vectorize everything or not, this stuff in conflict is a mess that I imagine loses a lot of students.

While I’ve been doing vectorized implementations, really I’d prefer to not vectorize over the training inputs. Write all my code to process one training example at a time. That’s the way a lot of stuff works in the real world anyway.

Anyway I stumbled through the notes and managed to implement backpropagation. I think I understand how it works too, although I just accept the provided cost function and derivatives as gospel truth rather than having done the derivation myself. And in the end I’ve trained my own neural network to recognize handwritten digits. Yay!

One fun thing in the homework is they visualize the hidden activation layer. That layer has 25 nodes. This image shows for each of those nodes, how much it weights the value of any particular pixel. If you squint you can convince yourself this hidden layer is doing some feature detection, like the bright vertical line in row 2, column 3. Also clearly the corners of the image are not useful to anything.

Screen Shot 2015-08-16 at 5.59.57 PM

I thought the technique of gradient checking was interesting. Backpropagation (and other cost functions) require you analytically divine the partial derivatives of your cost function, or in my case copy them blindly from the lecture notes, and then implement this in code. It’s pretty easy to have a bug. And apparently subtle errors in the gradient function result in neural networks that sort of converge but just not very well, so you might not notice the bug. Ng suggests as a backup check that you also numerically approximate the derivatives, by evaluating the cost function an epsilon away from some value and thereby calculate the gradient. Then check this numeric approximation is close to your analytically derived implementation. I like the idea, it’s a sort of interactive debugging technique. Also begs the question why bother with the analytic derivative at all, why not just use a numerical approximation? Because it’s expensive to calculate the numeric approximation, that’s why. And also it’s less accurate.

I also appreciate the role regularization plays. Apparently it’s not hard to overfit a neural network to input data. Ie: it is 100% accurate on your input data but then its overfit and may not work well on real data. The lambda regularization parameter is a damping applied to the gradients to encourage all the parameters to be small numbers, which apparently is the way you keep a neural network honest.

Here’s a picture of the hidden layer’s activations for an overfit model. Notice how there’s no obvious feature detection? At least, there’s no obvious vertical stroke detector like there was in the previous model. The accuracy of this model is 100%, but it may be less useful when applied to data it wasn’t trained on.

Screen Shot 2015-08-16 at 6.01.36 PM

That’s the kind of machine learning trap I’m taking this course to learn more about. I sort of understand the math behind why lambda works, but I can see it would take a lot of experience applying this stuff to real problems to really get a feel for how to set it right.

In the end I really hope to never, ever implement a neural network again in my life. I do honestly just want to use someone else’s neural network library. I do feel some sense of accomplishment having built my own though. And my real goal here is to understand a little of what’s going on inside the black box so I can use it better. I gather neural networks have a lot of subtle traps.

I still have to go and apply this stuff to real data I really care about. I started trying to do that a couple of weeks ago, in Python, but got frustrated trying to learn Pandas and scikit-learn and iPython notebooks all at once.

Machine learning: Neural networks introduction

Week four of my Coursera machine learning course was a breezy introduction to neural networks. The lecture videos were very high level but did a good job introducing the concept. The part I hadn’t understood before was how regression techniques are really best suited for linear prediction models, that building Nth order polynomials out of M features leads to O(N*M) work and badness. I also hadn’t really understood that neural networks are just a series of logistic regressions. The input variables are mapped through a logistic model to an intermediate hidden layer (of some chosen number of features), then the hidden layer is mapped again through a second logistic model to yield output variables. However the lecture stopped before we got to backpropagation, so for this week the method of training a neural network is still a mystery.

Logistic regression applied to OCR

The homework is a bit behind and out of sync with the lecture notes. The bulk of the work in the homework was still doing logistic regression, last week’s lecture concept. The hardest part was figuring out how to vectorize the naive loop implementation of the regularized logistic regression cost function we did last week. But I’d already vectorized it so I could just copy my solution from last week, gold star!

The more fun part was actually applying one of these learned models to do something useful with real data; OCR classification of handwritten numbers. The input was 5000 images, 20×20 greyscale pixel arrays, along with their classification (“this squiggle is the number 7”). Our job was to build a multiclass classifier to do the OCR, to predict a digit. So we took the regularized logistic regression cost function we just implemented and used fmincg() to search for the best parameters to match the data. The resulting output vector (theta) is our prediction model. Then we applied that learned model to classify input data. So I’ve now built a linear regression OCR system for handwritten numbers! The final system predicted the input set with 95% accuracy. The final model is quite large; 4010 separate integers. 401 weights for predicting each digit from 0–9, or one weight per pixel plus a constant term. Not exactly parsimony.

One neat thing about multiclass models is they don’t just output a predicted clas (“the number 7”), they also output a vector of probabilities for each possible value: “probability this image is the number 1, probability it is the number 2, …”. We crush those probabilities down to a single “this input is probably an image of the number 7”. But something to remember for later; machine learning models not only can return a prediction, but a confidence in that prediction. Or some ambiguity, I believe the math works such that a single image might have a 90% probability of being the number 7 and an 80% probability of being the number 9 (for a particularly ambiguous squiggle.)

Neural network forward propagation

The last part of the homework was implementing a basic neural network. Or rather the application of one, the forward propagation that maps the input data through the layers and gives outputs. We were handed parameters that had already been trained, so really this was just an exercise in “can you code up forward propagation?” Useful to do that myself though. In particular I had to puzzle out that the hidden layer consists of 25 nodes. So the final classifier is basically two steps. Logistic regression to map 400 pixels to 25 hidden nodes. And then a second logistic regression to map 25 hidden nodes to 10 probabilities. The central mystery of neural networks is what those “hidden nodes” really mean. And we have Deep Dream to thank for a lovely visualized expression of hidden states in a different kind of machine learning image processing system.


I almost gave up this week. Ran into a bunch of weird technical problems. By far the biggest one was me putting the line “print size(X)” in my code as a debugging aid and then forgetting about it. And suddenly my octave program is complaining about fig2dev missing and I’m down a rabbit hole of Homebrew installs trying to figure out what the hell is wrong. Turns out “print” means print, as in paper, and I needed “printf”. Derp.

I am also doing a lot of stumbling around and shallow learning. Like I know I need to combine the matrix X with the vector theta somehow but forgot which way. Rather than puzzle that out from mathematical principles, I just inspect and see X is 5000×400 and theta is 1×400, and I need to multiply them somehow, so the only sensible math is X*theta’ or theta*X’. So try one out and see if the homework oracle tells me I got the right answer, and call it a day. (Writing this out I realized I picked the wrong one of the two, which is why I keep having to transpose everything. oops.) Anyway this doesn’t feel like learning so much as just bashing about with the only thing that will pass a type checker. I keep telling myself I’m absorbing something though, and my real goal here is just to understand enough about what’s going on that I can use other people’s machine learning systems later.

I do wish the homework assignments had more interactive help. In particular a bit more hand-holding about how to run the code you write, understand what it’s doing, and test it against known good results. In desperation I started looking at the submit.m internals just to see what test cases the homework oracle was using. I don’t get to see the right outputs that way (they’re hidden on the server), but at least I have some reasonable test inputs I can look at.

The only actual course lecture notes are a wiki page. The wiki content is pretty good. But it requires a login to even read the pages! And Coursera’s wiki is broken somehow so you have to log in about once an hour, makes me very cranky.

Machine learning: Logistic regression

Just finished week 3 of Andrew Ng’s machine learning course on Coursera. I’m going to try to blog each week summarizing what I learned.

This week’s topic is logistic regression; predicting discrete outcomes like “success or failure” from numeric data inputs. Ie: “our diagnostics measure these 4 numbers for a tumor. Is it cancerous or benign?” Turns out logistic regression is basically just linear regression where the output set is restricted to the interval [0, 1]. Normal linear regression gives all real numbers on output, so you pass that through the sigmoid curve to bound the result to [0, 1]. That number is effectively “the probability the output is 0”, you can then threshold it at 0.5 to map it to a strictly binary “success or failure” output. There’s a bunch of math then to define the cost function and the partial derivatives of the cost function which you can then use with an optimization algorithm like gradient descent.

Logistic regression only classifies data into two classes. If you want N classes you do logistic prediction many times, ie: “is it A or not A? is it B or not B? is it C or not C?” and then pick the class ABC with the highest probability.

The predictor you get from the logistic regression is best understood in terms of a decision boundary, a drawing of the threshold that tips an input from success to failure. My final homework assignment was learning this threshold boundary to separate plusses from squares where the input data is 2 dimensional. The green line is the learned threshold boundary, some sixth order polynomial the magic optimizer found for me.

Screen Shot 2015-08-02 at 11.02.08 AM

Thankfully the course moved on a bit from math into “use the black box” by introducing fminunc, a black-box minimizer provided by Matlab/Octave. All it needs is the cost function and its partial derivatives and it does who-knows-what to find the minimum. Frankly I wish the class spent more time on how to use fminunc well and less time deriving the gradient descent solutions that I’ll never use again. But gradient descent is a nice simple optimizer and it is good to know how at least one works.

This week also introduced the concept of overfitting, of specializing your predictor too tightly to the data. Regularization was the solution provided for overfitting; basically biasing the cost function to prefer small values of theta, the learning model parameters we’re optimizing. Ng doesn’t really explain why small values of theta are “better” other than presenting some intuition that a theta_i of 0 means the parameter is fully ignored and a theta_i of 1000 means it’s way too overemphasized. The regularization parameter lambda is picked out of a hat much like the learning rate alpha is also picked for gradient descent.

Along the way I learned how to define anonymous functions in Octave which then gives you a really easy way to curry a partial function application (in this case, currying X and y into my costFunction so that fminunc only works on the parameter t.

fminunc(@(t)(costFunction(t, X, y)), initial_theta, options)

I continue to be amazed that I type vector equations into Octave and they Just Work on first try, despite my being rusty on linear algebra not to mention matrix programming languages. I’m kind of just accepting that if the homework grader says I got it right I’m done.

Some metacomments about the course… I read something somewhere that characterized this as an advanced undergraduate class, maybe sophomore or junior level. That feels about right and explains why it seems a little too easy. But easy in the right ways for me, I really don’t want to do the math derivations. Also I discovered a lot of previous students’ homework checked in on GitHub. Cheating would be stupid, who would I cheat but myself? But it’s nice to be able to look what other students did. Helped me verify arrayfun() was really the right thing to apply a scalar function to a matrix, for instance.

Now I feel like I’ve studied enough machine learning to apply it to a problem I care about. I intend to look at League of Legends match results, to see whether I can predict a game is a win or a loss based on match performance. The simplest thing is to look at end of game stats like kills or gold earned. Of course end of game I already know for sure if it was a win or a loss, but I’m thinking I can read the learned model parameters out to see how significant a contribution those inputs like kills are to whether a team wins. Alternately I can get ahold of some mid-game stats like “gold earned after 20 minutes”, whether that results in a win or a loss is a bit more of a legitimate prediction problem.

Multivariate linear regression, gradient descent

I’m taking Andrew Ng’s online Machine Learning course on Coursera. First time doing a MOOC for real, and on the fence about the learning style, but it is nice to have an organized class with weekly assignments.

Two weeks have gone by. The two weeks together sort of consist of one learning unit. You learn how to do linear regressions for datasets. What does that have to do with machine learning? Well, a linear regression model is a very simple form of predictive modelling for a dataset. “I fit this straight line to my 100 data points, then can use that line to predict values for arbitrary other inputs”.

The course is a bit schizophrenic about being math vs. computer programming. Ng’s lecture notes are entirely in terms of linear algebra, building up to result equations like

Screen Shot 2015-07-26 at 6.55.26 PM

(WTF? X is a matrix of your input feature set; m rows of n features each. y is an m row vector of expected feature outputs. Theta is an m row vector that is the coefficients of your linear regression prediction model. Alpha is the “learning rate”, a number that’s picked essentially by intuition. The assignment := is shorthand for iteration; we keep iteratively improving the theta vector until it converges.)

I hate linear algebra. Always did, ever since I was 19 years old and it was my 8AM class. It was the only math class I nearly failed, then crammed super hard the last week and got an A. Then promptly forgot it all. Happily, this class is also a programming class, and the actual exercises are “implement this function in Octave / Matlab”. So I get to turn that confusing math into simple code:

Screen Shot 2015-07-26 at 7.04.07 PM

While I’m a good programmer it’s been many years since I used a matrix programming language like Maple/Matlab/Octave/R. So getting to that function was hard-worn. I ended up implementing that by following Ng’s lecture progression. He starts with a simple single variable linear regression. I coded that using lots of loops so all the actual arithmetic was scalar operations. Then I tediously hand-translated all those loops into vector forms and generalized it to multivariable inputs. Good learning exercise both to remind me how linear algebra works and to learn the funky vagaries of Octave/Matlab execution. (TIL automatic broadcasting). It was gratifying to see how much faster the code ran in vector form!

Of course the funny thing about doing gradient descent for linear regression is that there’s a closed-form analytic solution. No iterative hillclimbing required, just use the equation and you’re done. But it’s nice to teach the optimization solution first because you can then apply gradient descent to all sorts of more complex functions which don’t have analytic solutions. If I end up getting to do genetic algorithms again I’m gonna be thrilled.

In the end I feel pretty proud of myself for completing week 2, doing all the optional extra work, and understanding it all. My long term goal here is just to understand enough about machine learning algorithms that I can stop worrying about how they are implemented, just bash about with someone else’s software libraries applied to my data. But it’s helpful to understand what’s going on under the hood.