Reputation: 153

pdfplumber not picking up column & issue with multiline data

So i'm struggling with two things with a pdf extraction script i've written.

The first thing being that the script isn't picking up the last column 'Serial Number' I've boxed the area I'm interested in along with the explicit vertical strategy lines that i need

As you can see from the screenshot the vertical lines neatly divide the columns and this is true on all the pages in the pdf. The boxed area also captures everything i want on all the pages.

My script is here:

import pdfplumber

pdf_file = r"C:\Users\xxxx\Downloads\Active Aircraft Register.pdf"
box = (0, 35, 980, 565)
explicit_vertical_lines = [18, 57, 127, 325, 518, 713, 830, 920, 984]

all_tables = []


with pdfplumber.open(pdf_file) as pdf:
    for page in pdf.pages:
        cropped_page = page.crop(bbox=box)
        table = cropped_page.extract_table(table_settings={
            "vertical_strategy": "explicit",
            "explicit_vertical_lines": explicit_vertical_lines,
            "horizontal_strategy": "text",
        })
        if table:
            all_tables.extend(table)

# Check if we have any tables extracted
if not all_tables:
    print("No tables found in the PDF.")
else:
    for row in all_tables[:10]:
        print(row)

The second issue is once the script completes the extract if any row has a multi-line cell in it (e.g. the Address or Type Variant) it then places each line separately with a space, is there anyway to make it all go on one line?

The PDF file that I'm using can be downloaded here to test the script:

https://web.archive.org/web/20240422115729/https://www.caacayman.com/wp-content/uploads/pdf/Active%20Aircraft%20Register.pdf

Upvotes: 3

Answers (1)

Vitalizzare

Reputation: 7250

pdfplumber 0.11.4

When extracting with the horizontal strategy 'text', it's important to remember that pdfplumber creates new horizontal edges for each detected line. These edges start at the horizontal position x0 of the leftmost character and end at x1 of the rightmost character within the bounding box. If vertical edges are defined too far from the endpoints of these newly created horizontal edges, they may not be detected as intersecting in the following step. This is why you might lose the rightmost column - the last vertical position in your explicit_vertical_lines does not intersect with the horizontal edges built from the word lines.

To resolve this issue, we can increase the intersection_x_tolerance parameter to a value that compensates for the gap between the horizontal edges and the vertical line:

"intersection_x_tolerance": explicit_vertical_lines[-1] - max(c['x1'] for c in page.chars)

Or we can redefine the position of the rightmost vertical line so that it aligns with the text on the right, like this:

explicit_vertical_lines[-1] = max(c['x1'] for c in page.chars)

For further examples, you can refer to this case, which provides more details, or this one with a lost row at the top, which stems from a similar issue.

However, I'd like to suggest an alternative approach that resolves both problems simultaneously. As observed, the data in each cell is vertically aligned to the top and each cell in the first column has only one line. Therefore, we can identify the top line of each cell using the words in the first column. This approach involves cropping the page to isolate the column, extracting the lines/words, and using their 'top' values as points for explicit_horizontal_lines. Here’s an example of the code we can use.

Let's load the data:

import pdfplumber
import urllib.request
from io import BytesIO

data_url = r'https://web.archive.org/web/20240422115729if_'\
           r'/https://www.caacayman.com/wp-content/uploads/pdf/'\
           r'Active%20Aircraft%20Register.pdf'
with urllib.request.urlopen(data_url) as file:
    pdf = pdfplumber.open(BytesIO(file.read()))

Remarks about the document:

Pages have shifted bounding boxes.
All these boxes are the same on all pages.
There's no lines on any page.
There's one rectangle comprising the headers on the first page.
There's no rectangle on other pages.

assert pdf.pages[0].bbox == (-11.9906, -11.962200000000053, 996.009, 600.0375999999999)
assert all(p.bbox == pdf.pages[0].bbox for p in pdf.pages)
assert all(len(p.lines) == 0 for p in pdf.pages)
assert len(pdf.pages[0].rects) == 1
assert all(len(p.rects) == 0 for p in pdf.pages[1:])

Let's use the headers and rectangle around them to define vertical lines. Note, that the words "Registered" and "Type" appear twice in the headers.Both times, we choose the one on the right to define the vertical line. All other header words are unique and safe to use as a column marker.

# Collect vertical lines positions
rect = pdf.pages[0].rects[0]
header_bbox = rect['x0'], rect['top'], rect['x1'], rect['bottom']
words = pdf.pages[0].crop(header_bbox).extract_words()
word_pos = {}
for w in words:
    word_pos[w['text']] = max(word_pos.get(w['text'], 0), w['x0'])
# Shift slightly to the left to avoid letter leakage between columns
vertical_lines = (word_pos[w] - 1 for w in     
    'From Registered Address Manufacturer Series Type Serial'.split())
vertical_lines = rect['x0'], *vertical_lines, rect['x1']

Deduce positions of horizontal lines by words in the first column and define settings with explicit horizontal and vertical lines:

# Define the top and bottom edges that enclose the data on all pages
top = rect['top']
bottom = pdf.pages[0].extract_text_lines()[-1]['top'] - 1
first_column_bbox = vertical_lines[0], top, vertical_lines[1], bottom

def get_horizontal_lines(page):
    '''Return positions of horizontal lines based on the words in the first column of the page'''
    words = page.crop(first_column_bbox).extract_words()
    return top, *sorted(w['top'] - 1 for w in words)[1:], bottom

def get_settings(page): 
    '''Return settings with explicit horizontal and vertical lines'''
    return {
        'horizontal_strategy': 'explicit',
        'vertical_strategy': 'explicit',
        'explicit_horizontal_lines': get_horizontal_lines(page),
        'explicit_vertical_lines': vertical_lines}

# See table layout on the first page
pdf.pages[0].to_image(100).debug_tablefinder(get_settings(pdf.pages[0])).show()

^{Data source: Active Aircraft Register.pdf}

P.S. To collect data from all pages, we could try literally feeding them in a chain:

import pandas as pd
from itertools import chain

headers, *data = chain(*(p.extract_table(get_settings(p)) for p in pdf.pages))
df = pd.DataFrame(data, columns=headers)

To make it easier to review the result, we might want to consider supplying page numbers as well, for example:

def get_dataframe(page, headers=None):
    replace_newlines = lambda x: x.replace('\N{new line}', '\N{space}')
    data = [map(replace_newlines, record)
            for record in page.extract_table(get_settings(page))]
    if headers is None:
        headers, *data = data        
    df = (pd.DataFrame(data, columns=headers)
          .rename_axis('index')
          .assign(page=page.page_number)
          .set_index('page', append=True)
          .reorder_levels(['page', 'index']))
    return df 

data = [get_dataframe(pdf.pages[0])]
headers = data[0].columns
data.extend(get_dataframe(page, headers) for page in pdf.pages[1:])
df = pd.concat(data, ignore_index=False)

page_number = 1
print(df.loc[page_number].to_string())

         Mark Registered From                               Registered Owner                                                                                           Address                      Manufacturer      Series Type     Type Variant Serial Number
index                                                                                                                                                                                                                                                           
0      VP-CAA        4/6/2020                 Augusta Aviation (Cayman) Inc.                                 87 Mary Street, George Town, Grand Cayman KY1-9001 Cayman Islands               Boeing Company, The    DC-9-8 Series  DC-9-87 (MD-87)         49767
1      VP-CAB       11/1/2023                 Asia Pacific Leasing Co S.A.S.                                                   14 boulevard de la Madeleine 75008 Paris France               Boeing Company, The     737-8 Series            737-8         39390
2      VP-CAC      11/26/2009                             Boston Air Limited  Maples Corporate Services Limited P.O. Box 309 Ugland House Grand Cayman KY1-1104 Cayman Islands                            Airbus  A330-200 Series         A330-243          1053
3      VP-CAD       1/29/2021                                      BB 11 LTD                                                         8703 Telephone Road Houston, TX 77061 USA               Boeing Company, The   737-200 Series          737-2W8         22628
4      VP-CAE      12/31/2019                            KC-Aviation Limited                              OMC Chambers Wickhams Cay I Road Town Tortola British Virgin Islands  Gulfstream Aerospace Corporation       G V Series            GV-SP          5051
5      VP-CAF       1/28/2016                     AT21 International Limited  Maples Corporate Services Limited P.O. Box 309 Ugland House Grand Cayman KY1-1104 Cayman Islands                   Bombardier Inc.      BD-700-1A10      BD-700-1A10          9686
6      VP-CAH        1/8/2024                         Phenix Jet Cayman SEZC                        BritCay House 236 Washington Road George Town, Grand Cayman Cayman Islands                   Bombardier Inc.      BD-700-2A12                          70167
7      VP-CAI       1/11/2024  Marlmount Capital Designated Activity Company                          5th Floor, The Exchange, George's Dock, IFSC, Dublin 1, D01 W3P9 Ireland                            Airbus  A330-200 Series         A330-243          1319

Upvotes: 1

pdfplumber not picking up column &amp; issue with multiline data

Answers (1)

Related Questions

pdfplumber not picking up column & issue with multiline data