Reputation: 1496
I am trying to extract few fields from OCR image. I am using pytesseract to read OCR image file and this is working as expected.
Code :
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(text)
Output :
ALS 1 Emergency Base Rate
Y A0427 RE ABC
Anbulance Mileage Charge
Y A0425 RE ABC
Disposable Supplies
Y A0398 RH ABC
184800230, x
Next, I have to extract A0427 and A0425 from the text.. but the problem is I am not loop through the whole line.. it's taking one character at a time and that's why my regular expression isn't working..
Code:
for line in text :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
Upvotes: 2
Views: 2859
Reputation: 18357
The problem in your regex is start anchor ^
which expects your matching text A0425
should start from the very start of line and that is indeed not the case as you have Y
and space before it. So just remove ^
from your regex and then you should be getting all expected strings. Also, you can change four of this [0-9]
to write as [0-9]{4}
and your shortened regex becomes,
A[0-9]{4}
You need to modify your current code like this,
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(re.findall(r'A[0-9]{4}', text))
This should prints all your matches without needing to loop individually into lines,
['A0427', 'A0425', 'A0398']
Upvotes: 0
Reputation: 909
text
is a string, default behavior for Python when looping over a string using a for
-loop is to loop through the characters (as a string is basically a list of characters).
To loop through the lines, first split the text into lines using text.splitlines()
:
for line in text.splitlines() :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
EDIT: Or use Patels answer to skip the loop all together :)
Upvotes: 1
Reputation: 200
Get rid of that for loop also, use only
x= re.findall(r'A[0-9][0-9][0-9][0-9]', text)
without any loop. ('remove ^ too')
Upvotes: 2