user1428066
user1428066

Reputation: 61

How to get the most accurate results with Tesseract OCR

I'm in the process of building/training Tesseract to recognize passport MRZ codes from a captured photo. I'm applying the following image pre-processing techniques before the photo/image is being sent to the Tesseract engine:

Furthermore I've already trained the Tesseract engine with the correct font (OCR-B) by creating numerous box files (from 35 or so samples that contain photos taken from textual samples of OCR-B font), fixing any mistakes in the box files, creating training files and finally training the Tesseract engine with all my samples and generating a traineddata file.

However even after all this Tesseract 3.04 in C# (engine mode = Default, pagesegmode = Auto) with my custom traineddata still makes simply mistakes such as:

Now for my question, what can I do so that Tesseract produces much more accurate results? My 30 training samples consisted of photos taken from:

  1. Passports
  2. Typed word pages with OCR-B font

Sample of what the input image would look like compared to what Tessearct receives: Image before and after pre-processing

Upvotes: 2

Views: 3703

Answers (1)

Eamonn Kenny
Eamonn Kenny

Reputation: 2072

Scale up to 480% using imagemagick convert program. Also introduce sharpening and whitening. Gives dramatic improvements. I see better results than many bought OCR programs doing this.

Upvotes: 1

Related Questions