go sgenq
go sgenq

Reputation: 323

extract borderless table with pdfplumber

I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly

pdf file can be downloaded from the link

Here is my code

import pdfplumber
pdf_file="pdffile"
with pdfplumber.open(pdf_file) as pdf:
    for i in range(0,len(pdf.pages)):
        try:
           if i==7:
               bold_title_text=pdf.pages[i]
               ff=bold_title_text.extract_table(table_settings=
                                                    {"vertical_strategy": "text", 
                                                     "horizontal_strategy": "lines",
                                                     "keep_blank_chars": "True",                                                                                                                          
                                                     "snap_tolerance": 4,
                                                   })
            display(ff[1])
       except IndexError:
           print("")
           break

enter image description here

output ['Element','nt Attribute Size Input Type Requirement']

Expected Output ['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']

Upvotes: 2

Views: 1722

Answers (1)

Samkit Jain
Samkit Jain

Reputation: 1748

For tables that have no vertical line separators, you can

  1. Crop the table part first
    1. Use the "text" strategy like you have in your question. Without the crop, it doesn't work well because the non-table text interferes with the table extraction logic.
    2. Use the "explicit" strategy for the vertical lines and specify the X-coordinates for the vertical lines.
  2. Use the "explicit" strategy for the vertical lines and specify the X-coordinates for the vertical lines. Since without cropping, have a post-processing logic to filter out the non-table data.

Here is an example for the explicit lines that works with the table you've shared

import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[6]
tables = p.extract_tables(table_settings={
    "vertical_strategy": "explicit",
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [90, 200, 250, 320, 440, 510],
})
for table in tables:
    print()
    for row in table:
        print(row)

With this your table output becomes

['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']
['TransmittingCountry', '', '2-character', 'iso:CountryCode_Type', 'Validation']

Upvotes: 2

Related Questions