Extract Data from PDF with Incorrect Structural OCR

Question

I have a regular inflow of invoice pdfs. I extract the data from these pdfs for various manipulations and storage.

Here's an example section:

The first step is to use Adobe's OCR. Then, I use tika to parse the pdf. In Python:

from tika import parser
parsedPDF = parser.from_file("the_file.pdf")

This is the expected output:

...
001 6 0 6 EA FSC450-WBKR FUTSAL, ADULT, WHT/BLK/RED BULK 


002 6 0 6 EA SS50-P SOCCER PURPLE/BLUE/WHITE BULK 


...

Rows are separated by newlines and a row you see on the pdf gets parsed as a full row (see below).

This is the actual output:

001 6 0 6 


002 6 0 6 


003 13 0 13 


004 3 0 3 


EA FSC450-WBKR FUTSAL, ADULT, WHT/BLK/RED BULK 


EA SS50-P SOCCER PURPLE/BLUE/WHITE BULK 


...

The OCR created a structure where the row you see on the pdf is split into two sections[*note]. The split happens between the "Shipped" and "Unit" headings.

For item 002, if I drag from the "#" heading to the "Packaging" heading, it first selects data down the first section, then jumps up to the top of the second section.

Is there a good solution to this issue? Is there a way to define structure for the OCR (e.g., so it reads a line as a single row?)

[*note]: It's actually that the text is wrapped vertically (compare to the horizontal text wrap usually seen).

PaulMcG · Accepted Answer

Rather than try to recast the data, just work with what you have. You are getting two groups of lines, the first group containing the left half the the line of data, the second group containing the right half. itertools.groupby is great for splitting up rows by some grouping criterion. In this case, you can tell that the left-half lines all start with a numeric digit, while the right half lines don't.

Once you have these broken into two equal-sized groups, use Python's builtin method zip to stitch them back together. Then a succession of split()s can help you parse the content of each line - see the comments in the code below:

from itertools import groupby

lines = """
001 6 0 6 


002 6 0 6 


003 13 0 13 


004 3 0 3 


EA FSC450-WBKR FUTSAL, ADULT, WHT/BLK/RED BULK 


EA SS50-P SOCCER PURPLE/BLUE/WHITE BULK 


EA SS30-G SOCCER BALL GREEN/WHITE #3 BULK 


EA VQ2000-RGW COMPOSITE VB ROYAL/GOLD/WHITE BULK 


""".splitlines()

# filter out empty lines
lines = filter(None, lines)

# use groupby to walk the list, and get the lines that start with 
# numbers vs those that don't - from your description, there should be
# two groups
groups = []
for _, grouplines in groupby(lines, key=lambda ll : ll[0].isdigit()):
    groups.append(list(grouplines))

# validate the input - should be two groups of line, each the same length
assert len(groups) == 2
assert len(groups[0]) == len(groups[1])

# use zip to walk the two groups together, and create list of consolidated data
consolidated = [left + right for left,right in zip(groups[0], groups[1])]

# now break these strings up into their various pieces, using a succession of split()s
parsed_lines = []
for cons_line in consolidated:
    left_items = cons_line.split(None, 4)
    right_items = left_items.pop(-1).rsplit(None,1)
    right_items, qty_type = right_items
    um, desc = right_items.split(None, 1)
    parsed_lines.append(list(map(int,left_items) + [um, desc, qty_type]))

# dump out the parsed lines
for data in parsed_lines:
    print(data)

Gives:

[1, 6, 0, 6, 'EA', 'FSC450-WBKR FUTSAL, ADULT, WHT/BLK/RED', 'BULK']
[2, 6, 0, 6, 'EA', 'SS50-P SOCCER PURPLE/BLUE/WHITE', 'BULK']
[3, 13, 0, 13, 'EA', 'SS30-G SOCCER BALL GREEN/WHITE #3', 'BULK']
[4, 3, 0, 3, 'EA', 'VQ2000-RGW COMPOSITE VB ROYAL/GOLD/WHITE', 'BULK']

Extract Data from PDF with Incorrect Structural OCR

Answers (1)

Related Questions