Reputation: 143
I would like to use PyTesseract and OpenCV to read (hundreds) of pages of information like the following into JSON or CSV. How can I let tesseract know about the solid line in the middle dividing the two columns of information? Furthermore, some rows of data are 2 lines instead of 1. What's the best way to account for that?
I'm fairly new at using tesseract and any help would be appreciated!
This is what I have now:
# OCR
txt = pytesseract.image_to_string(thr, config="--psm 11")
# Add ocr to the corresponding part
txt = txt.split("\n")
row = 0
col = 0
for txt1 in txt:
# Skip over OCR strings that are just spaces or ''
if txt1.isspace() or txt1 == '':
continue
# Hard code detection in...let's just place it into the last column for now
# Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row
if re.match(r"\d*\sOpen\sRestaurants", txt1):
col == 3
worksheet.write(row//4, col%4, txt1)
col += 1
row += 1
workbook.close()
Everything above this code chunk is identical.
However, there are still a lot of miss-alignments, especially when some addresses or names take more than one line. Additionally, why is the text on the first line read in a different order compared to the rest of the rows?
I was thinking that perhaps I could enforce that every fourth txt is in alphabetical order and use that to detect misalignment? But if even the first row is incorrect, I'm not sure how much I want to hard code corrections. Additionally, sometimes the multiple line entries arise from the address column while other times it arises from the name column (e.g. 258 Interstate Commercial Park Loop on the left-hand side of the page).
Here are some screenshots of the mix-ups on the left:
And on the right:
Upvotes: 0
Views: 2395
Reputation: 8005
You have multiple choices, xlswriter, pandas, etc. For instance you can look at the tutorial for xlswriter
.
You can't. You need to manually divide the image by width into two parts. For instance: first-part, second-part
How do you manually divide the image by width?
First get the size of the image, then set the indexes.
# Get the size
(h, w) = img.shape[:2]
# First part
first_part = img[0:h, 0:int(w/2)]
# Second part
second_part = img[0:h, int(w/2):w]
Tesseract will account that but you need to know the followings:
The input image contains no artifact. Therefore at the first glance image preprocessing seems unnecessary. Still you can apply binarisation to make sure to get the best accuracy.
Images are rescaled to fit the size. As we can see we can get the output by assume the image as a single uniform block of text.
First, you need to store the OCR results in lists.
if i == 0:
for sentence in txt:
part1.append(sentence)
else:
for sentence in txt:
part2.append(sentence)
Second, you need to pair the list tuples.
for txt1, txt2 in zip(part1, part2):
worksheet.write(row, col, txt1)
worksheet.write(row, col + 1, txt2)
row += 1
zip function enable us to get the pair of data from each column in each iterator. Then we write the values to the corresponding columns.
Some data in the excel may not be accurate. If that's the case then you need to try the image with different processing methods with different page-segmentation-modes.
Code:
# Load the libraries
import cv2
import pytesseract
import xlsxwriter
# Load the image in BGR format
img = cv2.imread("WFJO2.jpg")
# Initialize the workbook
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
part1 = []
part2 = []
# Get the size
(h, w) = img.shape[:2]
# Initialize indexes
increase = int(w / 2)
start = 0
end = start + increase
# For each part
for i in range(0, 2):
# Get the current part
cropped = img[0:h, start:end]
# Convert to the gray-scale
gry = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)
# Threshold
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# OCR
txt = pytesseract.image_to_string(thr, config="--psm 6")
# Add ocr to the corresponding part
txt = txt.split("\n")
if i == 0:
for sentence in txt:
part1.append(sentence)
else:
for sentence in txt:
part2.append(sentence)
# Set indexes
start = end
end = start + increase
for txt1, txt2 in zip(part1, part2):
worksheet.write(row, col, txt1)
worksheet.write(row, col + 1, txt2)
row += 1
workbook.close()
Upvotes: 3