walter_anderson
walter_anderson

Reputation: 19

How to extract table details into rows and columns using pdfplumber

I am using pdfplumber to extract tables from pdf. But the table in use does not have visible vertical lines separating content so the the data extracted are into 3 rows and one huge column.

sample screenshot of pdf table - grey boxes are text just hidden

I would like the above table to come into 13 rows.

import pdfplumber
import pandas as pd
import numpy as np
with pdfplumber.open('test.pdf') as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

print(tables)

From the documentation I could not understand if there was a specific table settings I could apply. I tried some but it did not help.

Upvotes: 0

Views: 13487

Answers (2)

Ibrahim Ayoup
Ibrahim Ayoup

Reputation: 473

Please add below settings when using extract_tables() option (This may need to be changed based on your input file) :

import pdfplumber
import pandas as pd
import numpy as np

with pdfplumber.open(r'document.pdf') as pdf:
    page = pdf.pages[0]    
    table = page.extract_table(table_settings={"vertical_strategy": "lines", 
                                               "horizontal_strategy": "text", 
                                               "snap_tolerance": 4,})
    
    df = pd.DataFrame(table, columns=table[0]).T

Morover, Please have a read on pdfplumber documentation (extracting-tables) section, as there is many options to include in your code based in your input file :

https://github.com/jsvine/pdfplumber#extracting-tables

Upvotes: 2

Swapnal Shahil
Swapnal Shahil

Reputation: 84

You can use pandas.DataFrame to customize your table instead of directly printing the table.

df = pd.DataFrame(tables[1:], columns=tables[0])
for column in df.columns.tolist():
    df[column] = df[column].str.replace(" ", "")

print(df)

Upvotes: 0

Related Questions