Monark Unadkat
Monark Unadkat

Reputation: 190

Extract tabular data from images

We have built a model that detects table regions.

The next step is to parse the detected table image and convert it into a CSV/Dataframe. We are facing issues with that and we have already tried a few techniques,

Tried opencv reduce method to get vertical lines or columns separation, but it fails when there is more distance between words(sample shared below). The white boxes in the sample image are actual locations of the word detected by the OCR system.

The below code executes twice on the image, 1. Image is passed to the OCR system and it returns detected text along with its bounding box. 2. We plot the bounding boxes on an image with a black background. 3. We then pass the image to the below code twice, first - the original plotted image to get horizontal lines co-ordinates second - the plotted image is rotated by 90 degrees and then again passed to the same code to get the vertical line coordinates.

By plotting the lines by using the coordinates, we get the below result. this is just for visualization. But it fails in cases like this.

enter image description here Sharing the code as well.

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
hist = cv2.reduce(gray, 1, cv2.REDUCE_AVG).reshape(-1)
th = 2
H, W = img.shape[:2]
lowers = [y for y in range(H - 1) if hist[y] > th and hist[y + 1] <= th]

for y in lowers:
   img=cv2.line(img, (0,y), (W, y), (0,255,0), 1)
cv2.imwrite("demo_img.png", img)

For more sample documents

Appreciate the help

Upvotes: 6

Views: 10174

Answers (2)

nathancy
nathancy

Reputation: 46600

If you're trying to detect text in the image using OCR, its important to preprocess the image to remove noise, filter out undesired objects, and in this case, remove the grid lines. Here's a simple approach to obtain a binary image, repair horizontal grid lines for detection, remove horizontal table lines, remove vertical table lines, and then perform OCR using Pytesseract. Here's the result with some of your images.


Before -> After and OCR result

enter image description here enter image description here

ASSETS

Checking & Savings ACCOUNT BEGINNING BALANCE — ENDING BALANCE
THIS PERIOD THIS PERIOD

Chase Total Checking 000000629831256 $174.02 $5.28

Chase Savings 000003313056365 25.00 0.72

Total $199.02 $6.00

TOTAL ASSETS $199.02 $6.00

enter image description here enter image description here

HIBACHI GRILL ASIAN ELK GROVE VIL IL 10/23 (...4719) Card -$34.00 $1,531.31
Oct 23,2018 SAMSCLUB #6464 DES PLAINES IL 10/23 (...4719) Card -$26.07 $1,565.31
Oct 15,2018 SAMS CLUB SAM'S Club DES PLAINES IL 10/14 (...4719) Card -$36.07 $1,591.38
Premier *Bankcard LLC 605-3573440 SD 10/14 (...4719) | Card -$70.00 $1,627.45
CANOPY-BUFFETT DES PLAINES IL 10/14 (...4719) Card -$33.24 $1,697.45
COMCAST CHICAGO CS 1X 800-266-2278 IL 10/14 (...4719) Card -$275.45 $1,730.69
ATM CHECK DEPOSIT 10/13 1590 LEE ST DES PLAINES IL ATM deposit $803.92 $2,006.14
Oct 12,2018 VILLAGE OF ROSEM DIRECT DEP PPD ID: 9111111103 ACH credit $604.60 $1,202.22
Oct 11,2018 DEPOSIT ID NUMBER 706989 Deposit $541.56 $597.62
Oct 10, 2018 AURORA UNIVERSITY 800-742-5281 IL 10/09 (...4719) Card -$450.00 $56.06
Oct 9, 2018 ATM CASH DEPOSIT 10/08 1590 LEE ST DES PLAINES IL ATM transaction $400.00 $506.06
Oct 2, 2018 Convenience Fee WEB PAY Vaughn WEB ID: 2364303385 ACH debit -$1.50 $106.06
Vaughn WEB PAY Vaughn WEB ID: 1364303385 ACH debit -$1,118.10 $107.56
AURORA UNIVERSITY 800-742-5281 IL 10/01 (...4719) Card -$550.00 $1,225.66
Oct 1, 2018 SPEEDWAY 04250 DES DES PLAINES IL 09/29 (...4719) Card -$35.08 $1,775.66
ATM CASH DEPOSIT 10/01 1590 LEE ST DES PLAINES IL ATM transaction $380.00 $1,810.74
Sep 28, 2018 VILLAGE OF ROSEM DIRECT DEP PPD ID: 9111111103 ACH credit $561.62 $1,430.74
ATM CHECK DEPOSIT 09/28 1590 LEE ST DES PLAINES IL ATM deposit $785.45 $869.12
Sep 24,2018 SPEEDWAY 04250 DES DES PLAINES IL 09/21 (...4719) Card -$14.93 $83.67

enter image description here enter image description here

DATE DESCRIPTION AMOUNT
06/27 Card Purchase 06/26 Culinart 119 At Con Long Island C NY Card 0018 $3.43
06/27 Card Purchase 06/27 Tst* Slice - Long |s Long Island C NY Card 0018 7.50
06/28 Card Purchase 06/27 Paypal *Netflix.Com 402-935-7733 CA Card 0018 13.99
06/28 Card Purchase 06/27 Culinart 119 At Con Long Island C NY Card 0018 6.26
06/29 Card Purchase 06/27 Butcher Bar Astoria NY Card 0018 10.00
| 06/29 Card Purchase 06/28 Culinart 119 At Con Long Island C NY Card 0018 5.93
| 06/29 Card Purchase 06/28 Boston Market 1669 Woodside NY Card 0018 11.90
| 06/29 Card Purchase 06/29 Caridad& Louis Rest Bronx NY Card 0018 31.79
| 06/29 Card Purchase With Pin 06/29 Superior Deli Long Island C NY Card 0018 8.00
07/02 Card Purchase 06/29 Culinart 119 At Con Long Island C NY Card 0018 2.88
07/02 Card Purchase 06/29 Bel Aire Diner Astoria NY Card 0018 18.53
07/02 Card Purchase 06/30 Gulf Oil 92039469 Bronx NY Card 0018 30.00
07/02 Card Purchase 06/30 Front Street Pizza Brooklyn NY Card 0018 6.26
07/02 Card Purchase 06/30 Gulf Oil 92039469 Bronx NY Card 0018 63.22
07/02 Card Purchase With Pin 07/01 Four Brothers Discount Bronx NY Card 0018 19.54
07/02 Card Purchase 07/01 Medonald's F2658 Bronx NY Card 0018 44.98
07/03 Recurring Card Purchase 07/03 Spotify USA 646-8375380 NY Card 0018 9.99
07/05 Card Purchase 07/02 Eastside Mkt Corp New York NC Card 0018 9.26
07/05 Card Purchase 07/03 Salvo's Pizza Bar New York NY Card 0018 15.00
07/05 Card Purchase 07/03 Eastside Mkt Corp New York NC Card 0018 8.79
07/05 Card Purchase 07/04 3340 Dominos Pizza 734-930-3030 NY Card 0018 37.58
07/09 Card Purchase 07/05 Eastside Mkt Corp New York NC Card 0018 9.78
07/09 Card Purchase 07/06 Salvo's Pizza Bar New York NY Card 0018 8.68
07/09 Card Purchase 07/07 Medonald's F2658 Bronx NY Card 0018 18.05
| 07/09 Card Purchase 07/08 lhop 4634 Bronx NY Card 0018 34.70
07/09 Recurring Card Purchase 07/06 Ibi*Shoedazzle 888-5081888 CA Card 0018 39.95
07/10 Card Purchase 07/09 Culinart 119 At Con Long Island C NY Card 0018 2.88
07/10 Card Purchase 07/09 Paypal *Bioceutical 402-935-7733 CA Card 0018 65.75
107/10 Card Purchase 07/09 Mamas Fmnanadas Astoria NY Card 0018 1178
07/10 Card Purchase With Pin 07/10 Community Green Market Bronx NY Card 0018 55.98

Code

import cv2
import pytesseract
import numpy as np

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, Otsu's threshold
image = cv2.imread('7.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Repair horizontal table lines 
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,1))
thresh = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=1)

# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (55,2))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(image, [c], -1, (255,255,255), 9)

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,55))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(image, [c], -1, (255,255,255), 9)

# Perform OCR
data = pytesseract.image_to_string(image, lang='eng',config='--psm 6')
print(data)

cv2.imshow('image', image)
cv2.imwrite('image7.png', image)
cv2.waitKey()

Note: The grid removal step was adapted from Removing Horizontal Lines in image (OpenCV, Python, Matplotlib). Depending on the image, the size of the kernel will change. For instance, to detect longer lines, we could use a (50,1) kernel instead. If we wanted thicker lines, we could increase the 2nd parameter to say (50,2).

Upvotes: 6

dvhamme
dvhamme

Reputation: 1450

The reduce operation will only work for this purpose if your document is perfectly aligned, with text direction horizontal. If you cannot guarantee that this is the case (as in your example), you have to do one of the following:

A) Estimate the rotation (e.g. measuring it using 2D DFT) and compensate for it

B) Pre-rotate the image for a range of angles (e.g. -3 degrees to 3 degrees in half degree increments), and identify the best result using a quality metric e.g. maximum separation between the non-zero bins of "hist".

Upvotes: 1

Related Questions