Reputation: 323
I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly
pdf file can be downloaded from the link
Here is my code
import pdfplumber
pdf_file="pdffile"
with pdfplumber.open(pdf_file) as pdf:
for i in range(0,len(pdf.pages)):
try:
if i==7:
bold_title_text=pdf.pages[i]
ff=bold_title_text.extract_table(table_settings=
{"vertical_strategy": "text",
"horizontal_strategy": "lines",
"keep_blank_chars": "True",
"snap_tolerance": 4,
})
display(ff[1])
except IndexError:
print("")
break
output ['Element','nt Attribute Size Input Type Requirement']
Expected Output ['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']
Upvotes: 2
Views: 1722
Reputation: 1748
For tables that have no vertical line separators, you can
Here is an example for the explicit lines that works with the table you've shared
import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[6]
tables = p.extract_tables(table_settings={
"vertical_strategy": "explicit",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [90, 200, 250, 320, 440, 510],
})
for table in tables:
print()
for row in table:
print(row)
With this your table output becomes
['Element', 'Attribute', 'Size', 'Input Type', 'Requirement']
['TransmittingCountry', '', '2-character', 'iso:CountryCode_Type', 'Validation']
Upvotes: 2