Reputation: 5681
How does the box file need to look like if I use a multipage tiff to train Tesseract?
More precisely: how do the Y-coordinates of a box file correspond to Y-coordinates within pages?
Upvotes: 0
Views: 1100
Reputation: 8345
The last, 6th column in the box file represents zero-based page number.
https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files
Update:
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
Each font should be put in a single multi-page tiff and the box file can be modified to specify the page number for each character after the coordinates. Thus an arbitrarily large amount of training data may be created for any given font, allowing training for large character-set languages.
Even if you can have as large training text as you want, it could potentially result in unnecessarily large image and hence slow down training.
Upvotes: 1