TensorFlow optimized builds

tl;dr: install these TensorFlow binaries for a 2-3x speedup.

Update: or not; turns out the AVX binaries probably only are 10% faster. See below.

I’m now running TensorFlow programs slow enough that I care about optimization. There’s several options here for optimized binaries:

  1. Stock TensorFlow
  2. TensorFlow recompiled to use Intel CPU parallel instructions like SSE and AVX. See also the warning stock TensorFlow gives:
    tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
  3. TensorFlow with the GPU

I’m trying to get from 1 to 2; from what I’ve read it’s a 2-3x speedup. GPU is even better of course but is a lot more complicated to set up. And the Linux box I do my work on doesn’t even have a GPU (although my Windows desktop does).

I’m testing this all with a simple hidden sigmoid layer neural network and Adam’s Optimizer, training to recognize MNIST data.

I tried building TensorFlow from source and quit pretty quickly. It requires bazel to build, which in turn requires a Java runtime, and I noped out. Probably could get it working with a couple of hours’ time.

I tried Intel’s optimized TensorFlow binaries. These seem not to be build with AVX; I still get the warning. They are also slower, my little program took 210s to run instead of 120s. Reading their blog post it sounds like this is mostly Intel’s crack optimization team reordering code so it runs more efficiently on their CPUs. (Intel has an amazing group of people who do this.) Also the patches were submitted back to Google and are probably in stock TensorFlow. Not sure why it’s slower, and I’m bummed they didn’t build with AVX, but here we are.

lakshayg’s binaries. No idea who this guy is but sure, I’ll try a random binary from anyone! Bingo! My program goes from 120s to 46s, or a 2.6x speedup. Hooray! (But see below). One slight caveat; this is 1.4.0rc1, not the latest 1.4.1. There’s about two weeks worth of bug fixes missing.

TinyMind’s Tensorflow wheels are another source of precompiled Linux versions of Tensorflow. They’re built with AVX2 which unfortunately my processor doesn’t support.

Starting with 1.6 Google is going to release AVX binaries only. This breaks older CPUs, shame they can’t release several different binaries.

Update: I’ve noticed the performance isn’t stable.  With the AVX binaries my program runs sometimes in 46 seconds (yay!) and sometimes in 110 seconds (boo!). With Google’s stock build it’s sometimes 51 and sometimes 120. That suggests the AVX binaries aren’t a significant speedup for my program and I have a deeper mystery.

I spent several hours figuring this out. Turns out in the slow case, my program spends most of its time in mnist.next_batch(), I think when it runs out of data and has to reshuffle. I have no idea why it’s so variable or slow but it’s not an interesting failure given this is tutorial code. Does remind me I should learn more about how to manage test data correctly in TensorFlow.

If I stub out the batching so it’s not a factor my program runs in about 29s with the AVX binaries, 32s with stock binaries (no AVX). So maybe a 10% improvement. That’s not very exciting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s