Charles Kelly
Charles Kelly

Reputation: 17

tesseract 4 Why isn't my training data compiling

I am trying to train Tesseract 4 to recognise some electronic circuit diagram symbols such as a resistor, capacitor etc from images but there seems to be no straight forward guide into training tesseract and the official documentation seems to focus more on fonts instead of image data.

The reply on this post seems to be the most helpful thing I've found so far but when following the steps I get an error:

What I've done so far:

Note: I know I need way more data than this, this is simply just a test to get everything working and sucessfully make a .traineddata file.

When I run the command "make training MODEL_NAME=testModel_1" I get the following in my console:

@CKVM1:~/Downloads/tesstrain$ make training MODEL_NAME=testModel_1
find: ‘data/testModel_1-ground-truth’: No such file or directory
find: ‘data/testModel_1-ground-truth’: No such file or directory
Error: missing ground truth for training
Makefile:175: recipe for target 'data/testModel_1/list.train' failed
make: *** [data/testModel_1/list.train] Error 1

I believe the issue is that, in the post I linked the instructions say to the "START_MODEL" paramater which as far as I understand uses whichever language you set it as as a starting point to improve training time but since I'm using custom symbols and not actual letters I don't see how that would benefit me. It seems the issue is however, that it expects a (more general?) ground truth file to already be present before the training starts which I am unsure how to go about solving

Any ideas on how to resolve this?

Upvotes: 0

Views: 854

Answers (1)

Eric Ihli
Eric Ihli

Reputation: 1907

Make sure that your training data is in ´tesstrain/data/testModel_1-ground-truth´.

You can look at what ´make training´ is doing at https://github.com/tesseract-ocr/tesstrain/blob/0d972f86f4aaf88fde77e3445ff607e68866c882/Makefile#L200

You'll see that it's looking for something in the ´GROUND_TRUTH_DIR´.

$(ALL_GT): $(shell find $(GROUND_TRUTH_DIR) -name '*.gt.txt')
    @mkdir -p $(OUTPUT_DIR)
    find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs paste -s > "$@"

GROUND_TRUTH_DIR is, by default, ´GROUND_TRUTH_DIR := $(OUTPUT_DIR)-ground-truth´

And if we keep following back the path of environment variables...

# Name of the model to be built. Default: $(MODEL_NAME)
MODEL_NAME = foo

# Data directory for output files, proto model, start model, etc. Default: $(DATA_DIR)
DATA_DIR = data

# Output directory for generated files. Default: $(OUTPUT_DIR)
OUTPUT_DIR = $(DATA_DIR)/$(MODEL_NAME)

Given the output of your error message, it doesn't look like any of your environment variables have been changed from their defaults, which is good. Everything should work. It looks like the training program is complaining simply that you don't have a folder at ´tesstrain-data-testModel_1-ground-truth´, which is what is required.

Upvotes: 0

Related Questions