How to properly train a model for OCR with CreateML? Mine is not just bad, it is utterly worthless

Question

I created a dataset with 1 generated training image per letter and about 10 real-life test images per letter.

All of them are 10x14px, black&white (nicely binarized in preprocessing stage).

The resulting model recognizes all symbols as '1' (even the actual images from the testing set), so basically it is not working at all.

Who can point me in the right direction?

Here's the CreateML output -

Extracting image features from full data set.
Analyzing and extracting image features.
+------------------+--------------+------------------+
| Images Processed | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 1                | 1.74s        | 2.5%             |
| 2                | 1.96s        | 5.25%            |
| 3                | 2.17s        | 8%               |
| 4                | 2.39s        | 10.75%           |
| 5                | 2.60s        | 13.5%            |
| 10               | 3.68s        | 27%              |
| 25               | 6.90s        | 67.5%            |
| 37               | 9.48s        | 100%             |
| 36               | 9.26s        | 97.25%           |
+------------------+--------------+------------------+
Skipping automatic creation of validation set; training set has fewer than 50 points.
Beginning model training on processed features. 
Calibrating solver; this may take some time.
+-----------+--------------+-------------------+
| Iteration | Elapsed Time | Training Accuracy |
+-----------+--------------+-------------------+
| 0         | 0.038845     | 0.027027          |
| 1         | 0.139269     | 0.837838          |
| 2         | 0.268821     | 0.945946          |
| 3         | 0.317312     | 0.945946          |
| 4         | 0.367944     | 0.972973          |
| 5         | 0.422657     | 0.972973          |
| 10        | 0.713325     | 1.000000          |
| 24        | 1.495230     | 1.000000          |
+-----------+--------------+-------------------+
SUCCESS: Optimal solution found.
Extracting image features from evaluation data.
Analyzing and extracting image features.
+------------------+--------------+------------------+
| Images Processed | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 1                | 211.661ms    | 0.25%            |
| 2                | 425.538ms    | 0.75%            |
| 3                | 641.33ms     | 1.25%            |
| 4                | 861.215ms    | 1.75%            |
| 5                | 1.07s        | 2.25%            |
| 10               | 2.16s        | 4.75%            |
| 25               | 5.39s        | 12%              |
| 50               | 10.75s       | 24%              |
| 75               | 16.12s       | 36%              |
| 100              | 21.51s       | 48%              |
| 125              | 26.88s       | 60%              |
| 150              | 32.24s       | 72%              |
| 175              | 37.61s       | 84%              |
| 200              | 42.97s       | 96%              |
| 208              | 44.69s       | 100%             |
| 207              | 44.47s       | 99.5%            |
+------------------+--------------+------------------+
Trained model successfully saved at /mypath/ocr.mlmodel.

Matthijs Hollemans · Accepted Answer

OCR is hard.
You don't have nearly enough training examples.
Create ML uses a "feature extractor" that is very powerful. This also means it's really easy to the model to simply memorize all your training data since the images are really small and you only have a couple of them. There is not much you can do with Create ML to prevent this from happening.
Try training a simple logistic regression on your data using a package such as scikit-learn. It will work much better than a deep neural network.

How to properly train a model for OCR with CreateML? Mine is not just bad, it is utterly worthless

Answers (1)

Related Questions