theAlse
theAlse

Reputation: 5757

OCR library to read text from images(preferably python)

I need to read the text from some images, the images are clear and very low on noise. So my original thought was that it should be pretty easy to fetch the text. (little that I know)

I tested some python libraries without much success (pytesser) , they would get maybe 10% right. I turned to Googles tesseract-occ but it is still far from good.

Here is one example: enter image description here

and below is the result:

nemnamons

Ill
w_on

lhggerllo
' 59
' as

\M_P2ma\

vuu uu

Cafllode omer
Mom | Dyna
Mom | Dyna

lnggerllo



2vMnne= Tr2rspnn| Factory (Hexmy;

lalgeflll Uxzlconflg
w_o«
w_o«

cammem

What am I doing wrong? Or is OCR recognition really this bad?

Upvotes: 2

Views: 6322

Answers (1)

Snow
Snow

Reputation: 1138

You will need to pre-process the image, such as remove the noise, in order to get a better result. Later, you can use a library such as pytesseract, to get the text out of your image:

def get_string(img_path):
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)
    cv2.imwrite("removed_noise.png", img)    

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open("removed_noise.png"))

    return result

Upvotes: 1

Related Questions