Reputation: 19
I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column.
I would like the above table to come into 13 rows.
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)
From the documentation I could not understand if there was a specific table settings I could apply. I tried some but it did not help.
Upvotes: 0
Views: 13487
Reputation: 473
Please add below settings when using extract_tables()
option (This may need to be changed based on your input file) :
import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open(r'document.pdf') as pdf:
page = pdf.pages[0]
table = page.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table, columns=table[0]).T
Morover, Please have a read on pdfplumber
documentation (extracting-tables) section, as there is many options to include in your code based in your input file :
https://github.com/jsvine/pdfplumber#extracting-tables
Upvotes: 2
Reputation: 84
You can use pandas.DataFrame to customize your table instead of directly printing the table.
df = pd.DataFrame(tables[1:], columns=tables[0])
for column in df.columns.tolist():
df[column] = df[column].str.replace(" ", "")
print(df)
Upvotes: 0