Machine learning: backpropagation shitshow

Frustrating week in the machine learning course. We learned how to implement backpropagation in neural networks. Or at least we pretended to.

Lectures

Ng’s lectures skate over the math pretty fast and there’s like 3 separate snippets reassuring “if you don’t understand this don’t worry, it’s hard and I don’t always remember either lol!”. I think this course tried to walk the line between either really doing the math or not and fell in a gap in the middle. I’d prefer to have skipped the math entirely. I kind of did, honestly, and fell back on my “how do I fake this” survival skills from the few college math classes I didn’t like.

The useful part is Ng also emphasizes the intuitions and mechanical logic of what backpropagation is doing, and that I think I understand. Forward propagation is creating a mesh of hidden layers that map inputs to outputs with hidden weights. Backprop starts at the output layer and observes the error, how far the network’s output is off from expected output. It then tries to apportion that error to the previous layer’s nodes by spreading the error backwards, weighted by the activation parameters. And so error is propagated backwards through each layer and in the end you have some idea how much each activation weight contributed to the mistakes the output made.

More concretely, backpropagation gives you a cost function J and its partial derivatives where each activation weight in the neural network gets a fair share of blame for the error. With cost function in hand you can run a function optimizer to find the activation weights that minimize the cost. (One fact Ng skated over; this cost function is no longer convex, which means the optimizer might get stuck in local minima. He mentions the problem and says not to worry about it in practice. Um, ok).

One fun thing in optimizing neural networks: if you start one out with node having constant activation weights (say, all zero) then all the nodes look the same and error is apportioned equally, so nodes don’t train separately. To break this degenerate symmetry you start training the network with randomized weights. Reminds me of the Big Bang and the lumpy cosmic microwave background.

Homework

The homework programming exercises are the only way I feel like I’m really learning anything. They’re reasonably good.

However, this week’s lecture + assignment clearly specify “don’t try to vectorize over the 5000 training inputs, just use a loop”. But then the tutorial that’s my real lifeline to implementing these assignments is a vectorized implementation! Which is it, teacher? I’ve mostly preferred implementing the vectorized versions, both because it’s faster to run and because it’s a closer match to the math equations from the lectures. It’s sure nice avoiding writing lots of for loops. But the teaching materials really need to decide to either vectorize everything or not, this stuff in conflict is a mess that I imagine loses a lot of students.

While I’ve been doing vectorized implementations, really I’d prefer to not vectorize over the training inputs. Write all my code to process one training example at a time. That’s the way a lot of stuff works in the real world anyway.

Anyway I stumbled through the notes and managed to implement backpropagation. I think I understand how it works too, although I just accept the provided cost function and derivatives as gospel truth rather than having done the derivation myself. And in the end I’ve trained my own neural network to recognize handwritten digits. Yay!

One fun thing in the homework is they visualize the hidden activation layer. That layer has 25 nodes. This image shows for each of those nodes, how much it weights the value of any particular pixel. If you squint you can convince yourself this hidden layer is doing some feature detection, like the bright vertical line in row 2, column 3. Also clearly the corners of the image are not useful to anything.

Screen Shot 2015-08-16 at 5.59.57 PM

I thought the technique of gradient checking was interesting. Backpropagation (and other cost functions) require you analytically divine the partial derivatives of your cost function, or in my case copy them blindly from the lecture notes, and then implement this in code. It’s pretty easy to have a bug. And apparently subtle errors in the gradient function result in neural networks that sort of converge but just not very well, so you might not notice the bug. Ng suggests as a backup check that you also numerically approximate the derivatives, by evaluating the cost function an epsilon away from some value and thereby calculate the gradient. Then check this numeric approximation is close to your analytically derived implementation. I like the idea, it’s a sort of interactive debugging technique. Also begs the question why bother with the analytic derivative at all, why not just use a numerical approximation? Because it’s expensive to calculate the numeric approximation, that’s why. And also it’s less accurate.

I also appreciate the role regularization plays. Apparently it’s not hard to overfit a neural network to input data. Ie: it is 100% accurate on your input data but then its overfit and may not work well on real data. The lambda regularization parameter is a damping applied to the gradients to encourage all the parameters to be small numbers, which apparently is the way you keep a neural network honest.

Here’s a picture of the hidden layer’s activations for an overfit model. Notice how there’s no obvious feature detection? At least, there’s no obvious vertical stroke detector like there was in the previous model. The accuracy of this model is 100%, but it may be less useful when applied to data it wasn’t trained on.

Screen Shot 2015-08-16 at 6.01.36 PM

That’s the kind of machine learning trap I’m taking this course to learn more about. I sort of understand the math behind why lambda works, but I can see it would take a lot of experience applying this stuff to real problems to really get a feel for how to set it right.

In the end I really hope to never, ever implement a neural network again in my life. I do honestly just want to use someone else’s neural network library. I do feel some sense of accomplishment having built my own though. And my real goal here is to understand a little of what’s going on inside the black box so I can use it better. I gather neural networks have a lot of subtle traps.

I still have to go and apply this stuff to real data I really care about. I started trying to do that a couple of weeks ago, in Python, but got frustrated trying to learn Pandas and scikit-learn and iPython notebooks all at once.