akozlu
akozlu

Reputation: 111

Training Tesseract for Handwritten Digits: mftraining step takes forever

I have been trying to train Tesseract 3.04 to recognise handwritten digits. The method was first presented in the paper in the following link : https://arxiv.org/abs/1003.5897. I've followed the necessary steps using Training Tesseract 3.04 wiki page and this tutorial: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/

I've created a single tiff image from the page I scanned that includes handwritten digits by me. I am able to create a box file and edit that box file using a certain third party GUI for tesseract (called tesseract4java). I've come to the mftraining step with no apparent issues.

But after giving the command: mftraining -F font_properties -U unicharset -O ali.unicharset ali.test_font.exp0.tr

The training step takes forever to run and after some point my laptop just crashes. Since I'm training only 10 characters with at most 15 instances of each character, I am assuming that this behaviour is happening because I've made an error in a previous step. Following are my ideas of what could've gone wrong:

  1. I have created a font_properties file and added a text file inside it with the required format. But since I'm also creating a new font at the same time, maybe tesseract does not recognise the new font or thinks I am mixing the fonts in a single tiff image. So should I add a new font name in my font properties file? But what font do handwritten digits supposed to have anyway?

  2. Training Tesseract page states that I should add my training text as a UTF-8 text file and I haven't done this step. I don't have a training text but an image and I didn't know how to translate digits into UTF-8 text file and where to put that file. Would this lead to the problem I am experiencing?

  3. Maybe the files I created are in the wrong directory. Currently all files that I attached (+unicharset and font_properties) are in the tesseract.304 directory. Should I add them to tessdata or create a new file within the tesseract directory?

Any help with answering these questions or any other suggestions regarding why my mftraining step is taking forever would be very much appreciated. Thank you very much.

Upvotes: 2

Views: 2122

Answers (1)

akozlu
akozlu

Reputation: 111

Okay I think the problem was that I did not preprocess my input image .tiff

After I converted the tiff image to 8bpp (bits per pixel) and converted to 300dpi denstiy, mftraining step was completed after couple of seconds. I used the following command: (from imagemagick)

convert -density 300 -depth 8 input.pdf output.tiff

Also I think changing the image to grayscale helps.

edit: also the font_properties file in the mftraining command should be named lang.font_properties

Upvotes: 0

Related Questions