- Standard OCR, where you apply standard supervised learning which takes an input and identify which character we decide it is
- Multi-class characterization problem
Getting lots of data: Artificial data synthesis
- We've seen over and over that one of the most reliable ways to get a high performance machine learning system is to take a low bias algorithm and train on a massive data set
- Where do we get so much data from?
- In ML we can do artificial data synthesis
- This doesn't apply to every problem
- If it applies to your problem, it can be a great way to generate loads of data
- Two main principles
- 1) Creating data from scratch
- 2) If we already have a small labeled training set can we amplify it into a larger training set
Character recognition as an example of data synthesis
- If we go and collect a large labeled data set will look like this
- The goal is to take an image patch and have the system recognize the character
- Let's treat the images as gray-scale (makes it a bit easer)

- How can we amplify this
- Modern computers often have a big font library
- If you go to websites, huge free font libraries
- For more training data, take characters from different fonts, paste these characters again random backgrounds
- After some work, can build a synthetic training set

- Random background
- Maybe some blurring/distortion filters
- Takes thought and work to make it look realistic
- If you do a sloppy job this won't help!
- So unlimited supply of training examples
- This is an example of creating new data from scratch
- Other way is to introduce distortion into existing data
- e.g. take a character and warp it

- 16 new examples
- Allows you amplify existing training set
- This, again, takes though and insight in terms of deciding how to amplify
Another example: speech recognition
- Learn from audio clip - what were the words
- Have a labeled training example
- Introduce audio distortions into the examples
- So only took one example
- Created lots of new ones!
- When introducing distortion, they should be reasonable relative to the issues your classifier may encounter
Getting more data
- Before creating new data, make sure you have a low bias classifier
- If not a low bias classifier increase number of features
- Then create large artificial training set
- Very important question: How much work would it be to get 10x data as we currently have?
- Often the answer is, "Not that hard"
- This is often a huge way to improve an algorithm
- Good question to ask yourself or ask the team
- How many minutes/hours does it take to get a certain number of examples
- Say we have 1000 examples
- 10 seconds to label an example
- So we need another 9000 - 90000 seconds
- Comes to a few days (25 hours!)
- Crowd sourcing is also a good way to get data
- Risk or reliability issues
- Cost
- Example
- E.g. Amazon mechanical turks
Ceiling analysis: What part of the pipeline to work on next
- Through the course repeatedly said one of the most valuable resources is developer time
- Pick the right thing for you and your team to work on
- Avoid spending a lot of time to realize the work was pointless in terms of enhancing performance
Photo OCR pipeline
- Three modules
- Each one could have a small team on it
- Where should you allocate resources?
- Good to have a single real number as an evaluation metric
- So, character accuracy for this example
- Find that our test set has 72% accuracy
Ceiling analysis on our pipeline
- We go to the first module
- Mess around with the test set - manually tell the algorithm where the text is
- Simulate if your text detection system was 100% accurate
- So we're feeding the character segmentation module with 100% accurate data now
- How does this change the accuracy of the overall system

- Accuracy goes up to 89%
- Next do the same for the character segmentation
- Accuracy goes up to 90% now
- Finally doe the same for character recognition
- Having done this we can qualitatively show what the upside to improving each module would be
- Perfect text detection improves accuracy by 17%!
- Would bring the biggest gain if we could improve
- Perfect character segmentation would improve it by 1%
- Perfect character recognition would improve it by 10%
- Might be worth working on, depends if it looks easy or not
- The "ceiling" is that each module has a ceiling by which making it perfect would improve the system overall
Other example - face recognition
- NB this is not how it's done in practice

- Probably more complicated than is used in practice
- How would you do ceiling analysis for this
- Overall system is 85%
- + Perfect background -> 85.1%
- + Perfect face detection -> 91%
- Most important module to focus on
- + Perfect eyes ->95%
- + Perfect nose -> 96%
- + Perfect mouth -> 97%
- + Perfect logistic regression -> 100%
- Cautionary tale
- Two engineers spent 18 months improving background pre-processing
- Turns out had no impact on overall performance
- Could have saved three years of man power if they'd done ceiling analysis