TensorFlow day 2

Some more tinkering with TensorFlow, in particular the MNIST for ML Beginners and Deep MNIST for Experts tutorials. MNIST is neat; it’s a standard normalized dataset of handwriting samples for the numbers 0-9. A classic for machine vision testing, with well known results and training accuracies of 88 – 99.5% depending on the approach. Consensus test data like this is so valuable in a research community. I worked with this dataset back in Ng’s Machine Learning class.

First up, MNIST for ML Beginners. It has you build a basic linear regression model to classify the numbers, then train it. Final accuracy is about 92%.

I followed this just fine, it’s pretty straightforward and not too different from the “getting started” tutorial. Just on real data (MNIST) and using some slightly more sophisticated functions like softmax and cross_entropy. Some notes:

  • TensorFlow has datasets built in, in the tensorflow.examples package.
  • The MNIST data set has a “.train” collection of training data and a (presumably disjoint) “.test” collection for final test data. The .train set also has a method .next_batch() which lets you randomly subsample rather than training on all data every single iteration.
  • The concept of “hot ones” representation. For labeling the digits 0-9 we have an array of 10 numbers (one per digit). Every number is 0 except for a single 1, which marks the label. There’s also the “tf.argmax()” function for quickly finding index of the column set to 1.
  • The softmax function which takes a vector of weights and normalizes it so it becomes a vector of probabilities that sum to 1. The weighting is exponential.
  • TensorFlow has an InteractiveSession which lets you mix declaring stuff with running session code conveniently. Good for noodling in a notebook.
  • “Loss functions”, basically a measure of the error between a prediction your model makes and the expected result data. These tutorials use the cross_entropy function, an information theory calculation that involves the probabilities of each outcome as well as just measuring the error.
  • tf.train.GradientDescentOptimizer() is a simple optimizer we apply here in a straightforward way. Note this is where TensorFlow’s automated differentiation comes into play, to do the gradient descent.

The second tutorial I did was Deep MNIST for Experts. This has you building a 4 layer neural network (aka “deep”) that maps 5×5 patches of the image to 32, then 64 features, then convolves it all to a single flat 1024 features before classifying it. Final accuracy is about 99.2%.

I had a harder time following this, it assumes a lot more machine learning knowledge than the previous tutorials. If you don’t know things like what a rectified linear neural network, what dropout does, or what the Adam Optimizer is you’re gonna be a bit lost. It me; I’m kind of blindly copying stuff in as I go.

  • The full source has this weird thing about name_scope in the code. I think this is an extra level of testing / type checking but I’m not positive. I left it out and my code seems to have worked.
  • This code gets a bit complicated because you’re working with rank 4 tensors, ie: one giant 4 dimensional array. The first dimension is test image #, the second and third are pixels (in a 28×28 square) and the fourth is a single column for color value. It’s a standard setup for 2d image processing, I imagine.
  • The network structure is neat. Intuitively you boil down 28×28 grey pixel values into 14×14 32 dimensional values. Then you boil that down again to 7×7 64 dimensional values, and finally to a single 1024 feature array. I’m fascinated to know more about these intermediate representations. What are those 1024 features? I expect one is “looks like a vertical line” and one is “looks like a circle at the top” and the like, but who knows. (I bet someone does.)
  • The pooling from 28×28 → 14×14 → 7×7 is odd to me. It uses max_pool, which I think means it just takes the maximum value from a 2×2 window. Surprised that blunt an instrument doesn’t throw things off. For that matter what does a derivative of this function mean?
  • Dropout sounds crazy; you randomly just drop nodes from the neural network during the training. This keeps the network honest, avoids overfitting. It feels a bit like randomly harassing someone while they’re studying to keep them on their toes. The paper they linked says Dropout is an alternative to regularization. I note this code doesn’t ever regularize its input, so I guess it works?
  • They also introduce the idea of initial weights in a neural network. I remember this from Ng’s course; you want them to not all be 0, because then nothing can break the symmetry. Also they give everything a positive bias term to avoid “dead neurons”. Not sure what that means.
  • The pluggable nature of Tensor modules is apparent here. Particularly the swap to the “Adam Optimizer” over a simple gradient descent. I have no idea what this algorithm does but using it is literally one line of code change. And presumably it’s better, or so the linked paper claims.
  • It’s slow! 20,000 training iterations on a i7-2600K is taking ~20 minutes. Now I wish I had the custom compiled AVX version, or a GPU hooked up :-) It is running as many threads as it should at least (7 or 8).
  • They have you running 20,000 training iterations but the accuracy measured against the training set converges to 0.99 by around 4000 iterations. I wonder how much the network is really changing at that point. There’s a lot of random jitter in the system with the dropouts and sampling, so there’s room. The accuracy against the test set keeps improving up to about 14,000 steps.

One thing these tutorials are missing is more visual feedback as you go along. That, and some easy way to actually use the model you’ve spent an hour building and training.

I’d like to go back and implement the actual neural network I built for MNIST for Ng’s class. IIRC it’s just 1 hidden layer. the 20×20 pixels are treated as a linear array of 400 numbers, then squashed via sigmoid functions to a hidden layer of 25 features, then squashed again to a hot ones layer of 10 numbers. It would be a good exercise to redo this in TensorFlow. The course notes describe the network in detail and suggest you expect about a 95.3% accuracy after training.