Machine Learning: Photo OCR, real world problem

These last two weeks of the course were so short I finished them all in one day.

I liked this last week, week 11. Most of the lectures were about a real world machine learning problem: finding and OCRing text in photographs.

Photo OCR

The key idea here is the pipeline. He broke the problem up into 4 separate machine learning problems that can be worked on independently.

  1. Find regions of the image that look like text.
  2. Segment text regions into individual characters.
  3. OCR individual characters
  4. Spell correct OCR errors

The real meat of the lesson was the sliding windows algorithm, the technique by which you identify text regions. Basically you look at small square regions of the image and train a recognizer to identify text/not-text. Then you merge adjacent regions of text into single blobs for handing off to step 2, the segmenter. The segmenter also uses a sliding windows analysis to find whitespace between characters, albeit a 1d window.

Lessons for machine learning

Second half of the lesson was reflections on doing better for machine learning.

First question: do you need more data? It may be expensive to get it. More data is most useful for low bias machine learning algorithms, so don’t bother if you have an overfit model. (Or rather, loosen your model first!). He also talked about artificial data synthesis, ways to generate more data from existing labelled training sets. For instance if you’re doing voice recognition, maybe you can generate more training examples by adding real-world distortion to existing examples. However don’t just add random noise, since that doesn’t really train anything useful.

Second question: what should I work on next? The key idea here is ceiling analysis. Basically you go down your pipeline replacing each step with a perfect system. Ie, replace step 1 with a perfect text region classifier; how much does your system improve? Now also replace step 2 with a perfect segementer: how much better does your system do? With that experiment in hand you can identify which steps in the pipeline have the most room for improvement, are worth your time to work on.

Once again I wish there were a programming exercise here to hammer these lessons home. Have us construct our own pipeline and do some testing / segmentation on it. I think this would be a good time to revisit the ideas of training vs. test sets, applying the test set well. Also evaluating systems for bias/variance problems. I suspect the Stanford undergrad class had students doing some final project which was just left out of the Coursera class, perhaps because it would be impossible to grade. Guess it’s up to me to apply this stuff to real problems and make my own mistakes without a teacher to tell me how I’m doing. Unsupervised learning, as it were.

And that’s the course! I’ve got one more blog post coming summarizing what all we covered and what I thought. Overall I’m quite positive, it was a good use of my time.