Reputation: 153
So i'm struggling with two things with a pdf extraction script i've written.
The first thing being that the script isn't picking up the last column 'Serial Number'
I've boxed the area I'm interested in along with the explicit vertical strategy lines that i need
As you can see from the screenshot the vertical lines neatly divide the columns and this is true on all the pages in the pdf. The boxed area also captures everything i want on all the pages.
My script is here:
import pdfplumber
pdf_file = r"C:\Users\xxxx\Downloads\Active Aircraft Register.pdf"
box = (0, 35, 980, 565)
explicit_vertical_lines = [18, 57, 127, 325, 518, 713, 830, 920, 984]
all_tables = []
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
cropped_page = page.crop(bbox=box)
table = cropped_page.extract_table(table_settings={
"vertical_strategy": "explicit",
"explicit_vertical_lines": explicit_vertical_lines,
"horizontal_strategy": "text",
})
if table:
all_tables.extend(table)
# Check if we have any tables extracted
if not all_tables:
print("No tables found in the PDF.")
else:
for row in all_tables[:10]:
print(row)
The second issue is once the script completes the extract if any row has a multi-line cell in it (e.g. the Address or Type Variant) it then places each line separately with a space, is there anyway to make it all go on one line?
The PDF file that I'm using can be downloaded here to test the script:
https://web.archive.org/web/20240422115729/https://www.caacayman.com/wp-content/uploads/pdf/Active%20Aircraft%20Register.pdf
Upvotes: 3
Views: 417
Reputation: 7250
pdfplumber 0.11.4
When extracting with the horizontal strategy 'text'
, it's important to remember that pdfplumber
creates new horizontal edges for each detected line. These edges start at the horizontal position x0
of the leftmost character and end at x1
of the rightmost character within the bounding box. If vertical edges are defined too far from the endpoints of these newly created horizontal edges, they may not be detected as intersecting in the following step. This is why you might lose the rightmost column - the last vertical position in your explicit_vertical_lines
does not intersect with the horizontal edges built from the word lines.
To resolve this issue, we can increase the intersection_x_tolerance parameter to a value that compensates for the gap between the horizontal edges and the vertical line:
"intersection_x_tolerance": explicit_vertical_lines[-1] - max(c['x1'] for c in page.chars)
Or we can redefine the position of the rightmost vertical line so that it aligns with the text on the right, like this:
explicit_vertical_lines[-1] = max(c['x1'] for c in page.chars)
For further examples, you can refer to this case, which provides more details, or this one with a lost row at the top, which stems from a similar issue.
However, I'd like to suggest an alternative approach that resolves both problems simultaneously. As observed, the data in each cell is vertically aligned to the top and each cell in the first column has only one line. Therefore, we can identify the top line of each cell using the words in the first column. This approach involves cropping the page to isolate the column, extracting the lines/words, and using their 'top'
values as points for explicit_horizontal_lines
. Here’s an example of the code we can use.
Let's load the data:
import pdfplumber
import urllib.request
from io import BytesIO
data_url = r'https://web.archive.org/web/20240422115729if_'\
r'/https://www.caacayman.com/wp-content/uploads/pdf/'\
r'Active%20Aircraft%20Register.pdf'
with urllib.request.urlopen(data_url) as file:
pdf = pdfplumber.open(BytesIO(file.read()))
Remarks about the document:
assert pdf.pages[0].bbox == (-11.9906, -11.962200000000053, 996.009, 600.0375999999999)
assert all(p.bbox == pdf.pages[0].bbox for p in pdf.pages)
assert all(len(p.lines) == 0 for p in pdf.pages)
assert len(pdf.pages[0].rects) == 1
assert all(len(p.rects) == 0 for p in pdf.pages[1:])
Let's use the headers and rectangle around them to define vertical lines. Note, that the words "Registered" and "Type" appear twice in the headers.Both times, we choose the one on the right to define the vertical line. All other header words are unique and safe to use as a column marker.
# Collect vertical lines positions
rect = pdf.pages[0].rects[0]
header_bbox = rect['x0'], rect['top'], rect['x1'], rect['bottom']
words = pdf.pages[0].crop(header_bbox).extract_words()
word_pos = {}
for w in words:
word_pos[w['text']] = max(word_pos.get(w['text'], 0), w['x0'])
# Shift slightly to the left to avoid letter leakage between columns
vertical_lines = (word_pos[w] - 1 for w in
'From Registered Address Manufacturer Series Type Serial'.split())
vertical_lines = rect['x0'], *vertical_lines, rect['x1']
Deduce positions of horizontal lines by words in the first column and define settings with explicit horizontal and vertical lines:
# Define the top and bottom edges that enclose the data on all pages
top = rect['top']
bottom = pdf.pages[0].extract_text_lines()[-1]['top'] - 1
first_column_bbox = vertical_lines[0], top, vertical_lines[1], bottom
def get_horizontal_lines(page):
'''Return positions of horizontal lines based on the words in the first column of the page'''
words = page.crop(first_column_bbox).extract_words()
return top, *sorted(w['top'] - 1 for w in words)[1:], bottom
def get_settings(page):
'''Return settings with explicit horizontal and vertical lines'''
return {
'horizontal_strategy': 'explicit',
'vertical_strategy': 'explicit',
'explicit_horizontal_lines': get_horizontal_lines(page),
'explicit_vertical_lines': vertical_lines}
# See table layout on the first page
pdf.pages[0].to_image(100).debug_tablefinder(get_settings(pdf.pages[0])).show()
Data source: Active Aircraft Register.pdf
P.S. To collect data from all pages, we could try literally feeding them in a chain:
import pandas as pd
from itertools import chain
headers, *data = chain(*(p.extract_table(get_settings(p)) for p in pdf.pages))
df = pd.DataFrame(data, columns=headers)
To make it easier to review the result, we might want to consider supplying page numbers as well, for example:
def get_dataframe(page, headers=None):
replace_newlines = lambda x: x.replace('\N{new line}', '\N{space}')
data = [map(replace_newlines, record)
for record in page.extract_table(get_settings(page))]
if headers is None:
headers, *data = data
df = (pd.DataFrame(data, columns=headers)
.rename_axis('index')
.assign(page=page.page_number)
.set_index('page', append=True)
.reorder_levels(['page', 'index']))
return df
data = [get_dataframe(pdf.pages[0])]
headers = data[0].columns
data.extend(get_dataframe(page, headers) for page in pdf.pages[1:])
df = pd.concat(data, ignore_index=False)
page_number = 1
print(df.loc[page_number].to_string())
Mark Registered From Registered Owner Address Manufacturer Series Type Type Variant Serial Number index 0 VP-CAA 4/6/2020 Augusta Aviation (Cayman) Inc. 87 Mary Street, George Town, Grand Cayman KY1-9001 Cayman Islands Boeing Company, The DC-9-8 Series DC-9-87 (MD-87) 49767 1 VP-CAB 11/1/2023 Asia Pacific Leasing Co S.A.S. 14 boulevard de la Madeleine 75008 Paris France Boeing Company, The 737-8 Series 737-8 39390 2 VP-CAC 11/26/2009 Boston Air Limited Maples Corporate Services Limited P.O. Box 309 Ugland House Grand Cayman KY1-1104 Cayman Islands Airbus A330-200 Series A330-243 1053 3 VP-CAD 1/29/2021 BB 11 LTD 8703 Telephone Road Houston, TX 77061 USA Boeing Company, The 737-200 Series 737-2W8 22628 4 VP-CAE 12/31/2019 KC-Aviation Limited OMC Chambers Wickhams Cay I Road Town Tortola British Virgin Islands Gulfstream Aerospace Corporation G V Series GV-SP 5051 5 VP-CAF 1/28/2016 AT21 International Limited Maples Corporate Services Limited P.O. Box 309 Ugland House Grand Cayman KY1-1104 Cayman Islands Bombardier Inc. BD-700-1A10 BD-700-1A10 9686 6 VP-CAH 1/8/2024 Phenix Jet Cayman SEZC BritCay House 236 Washington Road George Town, Grand Cayman Cayman Islands Bombardier Inc. BD-700-2A12 70167 7 VP-CAI 1/11/2024 Marlmount Capital Designated Activity Company 5th Floor, The Exchange, George's Dock, IFSC, Dublin 1, D01 W3P9 Ireland Airbus A330-200 Series A330-243 1319
Upvotes: 1