Danlo9
Danlo9

Reputation: 143

Tesseract Function to Split Into 2 Columns

I would like to use PyTesseract and OpenCV to read (hundreds) of pages of information like the following into JSON or CSV. How can I let tesseract know about the solid line in the middle dividing the two columns of information? Furthermore, some rows of data are 2 lines instead of 1. What's the best way to account for that?

I'm fairly new at using tesseract and any help would be appreciated!

enter image description here


Edit!!

This is what I have now:

# OCR
txt = pytesseract.image_to_string(thr, config="--psm 11")

# Add ocr to the corresponding part
txt = txt.split("\n")

row = 0
col = 0

for txt1 in txt:

    # Skip over OCR strings that are just spaces or ''
    if txt1.isspace() or txt1 == '':
        continue

    # Hard code detection in...let's just place it into the last column for now
    # Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row
    if re.match(r"\d*\sOpen\sRestaurants", txt1):
        col == 3
        
    worksheet.write(row//4, col%4, txt1)
    col += 1
    row += 1

workbook.close()

Everything above this code chunk is identical.

However, there are still a lot of miss-alignments, especially when some addresses or names take more than one line. Additionally, why is the text on the first line read in a different order compared to the rest of the rows?

I was thinking that perhaps I could enforce that every fourth txt is in alphabetical order and use that to detect misalignment? But if even the first row is incorrect, I'm not sure how much I want to hard code corrections. Additionally, sometimes the multiple line entries arise from the address column while other times it arises from the name column (e.g. 258 Interstate Commercial Park Loop on the left-hand side of the page).

Here are some screenshots of the mix-ups on the left: enter image description here

And on the right:

enter image description here

Upvotes: 0

Views: 2395

Answers (1)

Ahx
Ahx

Reputation: 8005

  • I would like to use PyTesseract and OpenCV to read (hundreds) of pages of information like the following into JSON or CSV.

You have multiple choices, xlswriter, pandas, etc. For instance you can look at the tutorial for xlswriter.

  • How can I let tesseract know about the solid line in the middle dividing the two columns of information?

You can't. You need to manually divide the image by width into two parts. For instance: first-part, second-part

How do you manually divide the image by width?

First get the size of the image, then set the indexes.

# Get the size
(h, w) = img.shape[:2]

# First part
first_part = img[0:h, 0:int(w/2)]

# Second part
second_part = img[0:h, int(w/2):w]
  • Furthermore, some rows of data are 2 lines instead of 1. What's the best way to account for that?

Tesseract will account that but you need to know the followings:


The input image contains no artifact. Therefore at the first glance image preprocessing seems unnecessary. Still you can apply binarisation to make sure to get the best accuracy.

Part-1 Part - 2
enter image description here enter image description here
Northpor. 13620 Highway 43 North Hardikkumar Patel (205) 339-1188
Northport 1836 McFaland Bld Harikkumar Patel {205} 339-1782
Northport 5550 McFarland Bld Sharmishta Patel (205) 200-7822
Odenvile 130 Council Orive Gratton Curbow (205) 629-7827
Oneonta 511 nd Ave E Govinddhai Patel (205) 625.5847

velxa 1017 Columbus Parkway Luis Cribb (934) 749-3628

alka 2300 Gateway Or Donna Cribb (234) 749-2308
pp 101 Stewart Ave ‘Utpa! Patel {334} 433-7325,
Orange Beach 25755 Perdido Beach Bhd. Patrick Shedd (251) 981-6881
Orange Beach 25814 Canal Rd Patrick Shed (251) 91-4184
Owens Crossroads 6707 Hwy 43) South Richard Hyde (256) 519-2425
Owns Gross Road 330 Sutton Road Richard Hyde (256) 518-2004
. . .
. . .
. . .
Talladega 244 Haynes SI tus Crisp (258} 315-0191
Talladega 608 East Batle Street Luis Cribb (256) 362-0781
Tallassee 454 Gimere Ave Donna Cribb (334) 283-2067
Tanner 5956 Hwy 31 N Mike Nadesi (256) 352-9808
Torani 1806 Pingon Valley Re Sanjayknat Patel (205) 849-0112
Theodore 5827 Hwy SOW Mukeshkumar Soparwala (251) 854-0048,
Theodore 6860 Theodore Dawes Rd Anthony Laf enier (251) 853-2010
Thamasvilie 33202 Hwy 43 Ranjeev Acharya (334) 636-0333
Thomasville 3430S Huy 43 Ranjeey Acharya (334) 636-0830
Tius 80 Tus Road Garret Gray (G34) 514-9930
Town Creek 2795 Hwy 20 Madhav Maina (256) 686-3900
Troy 1003 Highway 231 South Luis Cribb (339) 568-7944
Troy 1420 US 231 South Dehua Patel (334) 670-6390
. . .
. . .
. . .

Images are rescaled to fit the size. As we can see we can get the output by assume the image as a single uniform block of text.

How do you write the result?


  • First, you need to store the OCR results in lists.

    • if i == 0:
          for sentence in txt:
              part1.append(sentence)
      else:
          for sentence in txt:
              part2.append(sentence)
      
  • Second, you need to pair the list tuples.

    • for txt1, txt2 in zip(part1, part2):
          worksheet.write(row, col, txt1)
          worksheet.write(row, col + 1, txt2)
          row += 1
      

zip function enable us to get the pair of data from each column in each iterator. Then we write the values to the corresponding columns.

Some data in the excel may not be accurate. If that's the case then you need to try the image with different processing methods with different page-segmentation-modes.


Code:

# Load the libraries
import cv2
import pytesseract
import xlsxwriter

# Load the image in BGR format
img = cv2.imread("WFJO2.jpg")

# Initialize the workbook
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()

row = 0
col = 0

part1 = []
part2 = []

# Get the size
(h, w) = img.shape[:2]

# Initialize indexes
increase = int(w / 2)
start = 0
end = start + increase

# For each part
for i in range(0, 2):

    # Get the current part
    cropped = img[0:h, start:end]

    # Convert to the gray-scale
    gry = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)

    # Threshold
    thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # OCR
    txt = pytesseract.image_to_string(thr, config="--psm 6")

    # Add ocr to the corresponding part
    txt = txt.split("\n")

    if i == 0:
        for sentence in txt:
            part1.append(sentence)
    else:
        for sentence in txt:
            part2.append(sentence)

    # Set indexes
    start = end
    end = start + increase

for txt1, txt2 in zip(part1, part2):
    worksheet.write(row, col, txt1)
    worksheet.write(row, col + 1, txt2)
    row += 1

workbook.close()

Upvotes: 3

Related Questions