Reputation: 6372
I have trained a Tesseract 4 LSTM model against a set of ~30,000 ground truth images that I generated (as opposed to using "real" images from scanned works, of which I do not have enough to reliably train a model).
The model works well (or at least better than eng
, on which it is based). The image generation script has several parameters that I can adjust, but I'd like to do that in a more "ordered" way than just eyeballing the output, so I'd like to generate metrics based on accuracy across a (much smaller) set of real-world images.
However, it is not clear to me how you can take a set of line images and ground truth text files and generate the required files to run lstmeval
on the new model. How do you generate the data to feed to lstmeval
when the evaluation images are not related to the images actually used to train the model in the first place?
Upvotes: 0
Views: 642
Reputation: 6372
You can generate the .ltsmf
files needed for the evaluation like this, assuming the evaluation ground-truth is in tesstrain/data/eval-ground-truth
:
cd tesstrain
make lists MODEL_NAME=eval
This will generate a file data/eval/all-lstmf
, which contains a list of all the .lstmf
files generated. The list.eval
contains only a subset, as the ground truth corpus is partitioned into evaluation and training sets (according to RATIO_TRAIN
).
You can then run lstmeval
:
lstmeval \
--model data/your_model.traineddata \
--eval_listfile data/eval/all-lstmf
Producing something like this (the mistake below was added to the ground truth of one .gt.txt
file to provoke an error for example purposes):
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:TThoſe hypocrites that live amongſt us,
OCR :Those hypocrites that live amongst us,
At iteration 0, stage 0, Eval Char error rate=1.282051, Word error rate=8.333333
If there are no errors (as it was in this case), it looks like:
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=0.000000, Word error rate=0.000000
Upvotes: 1