Mark k
Mark k

Reputation: 153

pdfplumber only extracting the first row of data of every page in 300+ page document

I'm trying to extract all the rows of data from the following pdf:

https://www.seguridadaerea.gob.es/sites/default/files/aeronaves_inscritas.pdf

I've written a Python script that uses the pdfplumber library to try and extract the data. I've cropped the page so that its only focusing on the rows of data and not all the header/footer info. The script is shown below:

import pandas as pd
import pdfplumber


pdf_file = f"C:/Users/xxx/Downloads/aeronaves_inscritas.pdf"

box = (0, 132, 840, 555)

all_tables = []

with pdfplumber.open(pdf_file) as pdf:
    for page in pdf.pages:
        cropped_page = page.crop(bbox=box)
        
        table = cropped_page.extract_table(table_settings={
            "vertical_strategy": "lines",
            "horizontal_strategy": "lines",
        })
        
        filtered_table = [row for row in table if ''.join([str(i) for i in row]) != '']
        
        all_tables.extend(filtered_table)

df = pd.DataFrame(all_tables)

print(df.head())

df.to_csv('extracted_tables.csv', index=False)

The issue I'm having with the script is that it's only extracting the first line of every page and I just can't work out why. Can anyone help me work out why? I've tried using other pdf extraction libraries in Python but they don't seem to extract the rows of data very cleanly as pdfplumber does.

Upvotes: 3

Views: 452

Answers (2)

Vitalizzare
Vitalizzare

Reputation: 7250

Notes on adjusting table extraction settings

When adjusting table extraction settings, like in the answer provided by Richard Kraus, we may need to tweak several parameters simultaneously. For example, with a higher snapping tolerance, increasing the intersection tolerance might be necessary for cases where snapped edges go too far to be recognized as intersecting. Otherwise, we risk losing some records or columns, as seen here:

missed record example

In the case of this document, we need to either adjust the bounding box to exclude the highest residual horizontal line or increase the intersection tolerance along the verticals, as shown here:

fixed intersection tolerance error

Either way, I'd also suggest checking the detected tables before extraction and concatenation, sort of:

tables_on_page = [page.crop(bbox).find_tables(settings) for page in pdf.pages]

assert all(len(t)==1 for t in tables_on_page), 'Too many tables found'
assert all(len(t[0].columns) == 10 for t in tables_on_page), 'Incorect number of columns'
assert all(len(t[0].rows) > 1 for t in tables_on_page), 'Not enough records'

When the settings seem fine, we combine the data (page numbers and internal record indexing should also be preserved to double-check later):

from tqdm import tqdm       # for monitoring progress

captions = ['Matrícula', 'Fecha matric.', 'Fabricante', 'Modelo', 'Nº serie',
            'Año cons.', 'Marca Motor', 'Modelo Motor', 'Nº mot.', 'Clase']

settings = {
    "snap_y_tolerance": 10,
    "intersection_y_tolerance": 6,    # ensure that intersections are possible
}

bbox = (0, 134, 840, 555)   # adjust the top to exclude the residual upper horizontal line

tables = (                   # treated as a generator to minimize memory consumption
    pd.DataFrame(
        page.crop(bbox).extract_table(settings), 
        columns=captions)
    .assign(page=page.page_number)    # save the page number
    .reset_index(names='index')       # and internal record indexing for varification
    for page in tqdm(pdf.pages, 'Pages')
)

answer = pd.concat(tables, ignore_index=True).set_index(['page', 'index'])

Notes on using the default extraction settings

When using the default settings, we will get several tables per page for this document, usually one for each record, but there may be other variations if the rows are close enough. Although computing may require more resources in this case, it is still possible to complete the task. We simply supply each table's records one by one in a chain when combining them into a data frame:

from itertools import chain

data = (
    pd.DataFrame(
        chain(*page.crop(bbox).extract_tables()),
        columns=captions)
    .assign(page=page.page_number)
    .reset_index(names='index')
    for page in tqdm(pdf.pages, desc='Pages')
)
df = pd.concat(data, ignore_index=True).set_index(['page', 'index'])

assert df.equals(answer)    # check if the result differ from the previous one

In both approaches, the result for the provided document is the same.

Upvotes: 1

Richard Kraus
Richard Kraus

Reputation: 126

The table should be no problem for pdfplumber, it's just a matter of tuning the extraction settings.

Always use the debug_tablefinder and at first do it in a jupyter notebook, that way its easy to just play around with the settings and see the effect instantly.

Do something like:

# increase snap_y_tolerance until you see that the row lines are merged together.
# the setting joins parallel lines that are close together, which should solve your issue.
settings = {
   "vertical_strategy": "lines",
   "horizontal_strategy": "lines",
   "snap_y_tolerance": 10, 
}
im = page.to_image(resolution=150)
im.reset().debug_tablefinder(settings)
table = pd.DataFrame(page_crop_new.extract_table(settings))
display(im) # show the debug image
display(table)

After that, you can merge all the dataframes into one :)

Upvotes: 1

Related Questions