How to get the most accurate results with Tesseract OCR

Question

I'm in the process of building/training Tesseract to recognize passport MRZ codes from a captured photo. I'm applying the following image pre-processing techniques before the photo/image is being sent to the Tesseract engine:

Binarization
Normalization
Sampling
Denoising
Thinning (optionally)

Furthermore I've already trained the Tesseract engine with the correct font (OCR-B) by creating numerous box files (from 35 or so samples that contain photos taken from textual samples of OCR-B font), fixing any mistakes in the box files, creating training files and finally training the Tesseract engine with all my samples and generating a traineddata file.

However even after all this Tesseract 3.04 in C# (engine mode = Default, pagesegmode = Auto) with my custom traineddata still makes simply mistakes such as:

Confusing alphabet characters with numeric ones (or vice versa) for example S and 5, B and 8.

Now for my question, what can I do so that Tesseract produces much more accurate results? My 30 training samples consisted of photos taken from:

Passports
Typed word pages with OCR-B font

Sample of what the input image would look like compared to what Tessearct receives:

How to get the most accurate results with Tesseract OCR

Answers (1)

Related Questions