Kirk
Kirk

Reputation: 89

Improve pre-processing for OCR/Image Recognition

Currently I'm having a huge interest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more difficult.

I'm trying to read the value out of these captchas: http://img851.imageshack.us/img851/9579/57859946.png

I have written some filters for pre-processing:

Which outputs an images like this; http://img232.imageshack.us/img232/2325/00i3q45j1zt.png

As you can see there are holes in some letters. I first thought maybe it's better to leave the lines through the letters but that made it worse. I'm using the tesseract OCR engine and I trained it using the Elephant font (The font the captcha uses). I also tried using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).

The captcha is generated by a software package named PHPCaptcha.

Now my question is: Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?

I'm excited to hear from you guys

Greetings

Upvotes: 2

Views: 6990

Answers (1)

Gary Tsui
Gary Tsui

Reputation: 1755


Part 0 - Preface


i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract

ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)


Part 1 - Preprocessing


To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.


Part 2 - OCR/Tesseract


With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.

Upvotes: 2

Related Questions