Tesseract Function to Split Into 2 Columns

Question

I would like to use PyTesseract and OpenCV to read (hundreds) of pages of information like the following into JSON or CSV. How can I let tesseract know about the solid line in the middle dividing the two columns of information? Furthermore, some rows of data are 2 lines instead of 1. What's the best way to account for that?

I'm fairly new at using tesseract and any help would be appreciated!

Edit!!

This is what I have now:

# OCR
txt = pytesseract.image_to_string(thr, config="--psm 11")

# Add ocr to the corresponding part
txt = txt.split("
")

row = 0
col = 0

for txt1 in txt:

    # Skip over OCR strings that are just spaces or ''
    if txt1.isspace() or txt1 == '':
        continue

    # Hard code detection in...let's just place it into the last column for now
    # Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row
    if re.match(r"\d*\sOpen\sRestaurants", txt1):
        col == 3
        
    worksheet.write(row//4, col%4, txt1)
    col += 1
    row += 1

workbook.close()

Everything above this code chunk is identical.

However, there are still a lot of miss-alignments, especially when some addresses or names take more than one line. Additionally, why is the text on the first line read in a different order compared to the rest of the rows?

I was thinking that perhaps I could enforce that every fourth txt is in alphabetical order and use that to detect misalignment? But if even the first row is incorrect, I'm not sure how much I want to hard code corrections. Additionally, sometimes the multiple line entries arise from the address column while other times it arises from the name column (e.g. 258 Interstate Commercial Park Loop on the left-hand side of the page).

Here are some screenshots of the mix-ups on the left:

And on the right:

Tesseract Function to Split Into 2 Columns

Edit!!

Answers (1)

How do you write the result?

Related Questions