Praveenks
Praveenks

Reputation: 1496

Extracting text from OCR image file

I am trying to extract few fields from OCR image. I am using pytesseract to read OCR image file and this is working as expected.

Code :

import pytesseract
from PIL import Image
import re

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract- 
OCR\tesseract.exe"

value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(text)

Output :

ALS 1 Emergency Base Rate
Y A0427 RE ABC
Anbulance Mileage Charge

Y A0425 RE ABC
Disposable Supplies
Y A0398 RH ABC

184800230, x

Next, I have to extract A0427 and A0425 from the text.. but the problem is I am not loop through the whole line.. it's taking one character at a time and that's why my regular expression isn't working..

Code:

for line in text :
    print(line)
    x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
    print(x)

Upvotes: 2

Views: 2859

Answers (3)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

The problem in your regex is start anchor ^ which expects your matching text A0425 should start from the very start of line and that is indeed not the case as you have Y and space before it. So just remove ^ from your regex and then you should be getting all expected strings. Also, you can change four of this [0-9] to write as [0-9]{4} and your shortened regex becomes,

A[0-9]{4}

Regex Demo

You need to modify your current code like this,

import pytesseract
from PIL import Image
import re

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract- 
OCR\tesseract.exe"

value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)

print(re.findall(r'A[0-9]{4}', text))

This should prints all your matches without needing to loop individually into lines,

['A0427', 'A0425', 'A0398']

Upvotes: 0

gerwin
gerwin

Reputation: 909

text is a string, default behavior for Python when looping over a string using a for-loop is to loop through the characters (as a string is basically a list of characters).

To loop through the lines, first split the text into lines using text.splitlines():

for line in text.splitlines() :
    print(line)
    x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
    print(x)

EDIT: Or use Patels answer to skip the loop all together :)

Upvotes: 1

Patel
Patel

Reputation: 200

Get rid of that for loop also, use only

x= re.findall(r'A[0-9][0-9][0-9][0-9]', text)

without any loop. ('remove ^ too')

Upvotes: 2

Related Questions