Deep Learning solution for digit recognition on natural scene

Question

I am working on a problem, where I want to automatically read the number on images as follows:

As can be seen, the images are quite challenging! Not only are these not connected lines in all cases, but also the contrast differs a lot. My first attempt was using pytesseract after some preprocessing. I also created a StackOverflow post here.

While this approach works fine on an individual image, it is not universal, as it requires too much manual information for the preprocessing. The best solution I have so far, is to iterate over some hyperparameters such as threshold value, filter size of erosion/dilation, etc. However, this is computationally expensive!

Therefore I came to believe, that the solution I am looking for must be deep-learning based. I have two ideas here:

Using a pre-trained network on a similar task
Splitting the input images into separate digits and train / finetune a network myself in an MNIST fashion

Regarding the first approach, I have not found something good yet. Does anyone have an idea for that?

Regarding the second approach, I would need a method first to automatically generate images of the separate digits. I guess this should also be deep-learning-based. Afterward, I could maybe achieve some good results with some data augmentation.

Does anyone have ideas? :)

yakhyo · Accepted Answer

Regarding to your first approach,

There are two synthetically prepared datasets available:

Text Recognition Data consists from 9M images.
SynthText in the Wild consists from 8M images.

I have used above datasets for text recognition on slab images. Images were quite challenging however now I achieved more than 90% accuracy for that. I have implemented following models to solve this task. These are:

CRAFT for text localization.
Deep Text Recognition for text recognition.

If you are working with kinds of images only, I highly encourage you to try Deep Text Recognition. It is 4 stage framework.

For Transformation, you can choose TPS or None. With TPS, it has showed higher performance. They implemented Spatial Transformer Networks.
On Feature Extraction stage, you will have options: ResNet or VGG
For Sequential Stage, BiLSTM
Attn or CTC for prediction stage.

They achieved best accuracy on TPS-ResNet-BiLSTM-Attn version. You can easily fine tune this network and I hope it can solve your task. The model trained with above mentioned datasets.

Deep Learning solution for digit recognition on natural scene

Answers (2)

Related Questions