Tati
Tati

Reputation: 1

How to force pdfplumber to extract table according to the number of columns in the upper row?

I am trying to extract a table from PDF document with python package pdfplumber. The table has four columns and multiple rows. The first row are headers and the second row has only one merged cell, then the values are saved normally (example) pdfplumber was able to retrive the table, but it made 6 columns out if four and saved values not according to the columns.

Table as shown in PDF document enter image description here

I tried to use various table settings, including "vertical strategy": "lines", but this yields me the same result.

# Python 2.7.16   
import pandas as pd
import pdfplumber

path = 'file_path'
pdf = pdfplumber.open(path) 
first_page = pdf.pages[7]
df5 = pd.DataFrame(first_page.extract_table())

getting six columns instead of four with values in wrong columns. Output example:

Table as output in jupyter notebooks

I would be happy to hear, if anybody has any suggestion, solution.

Upvotes: 0

Views: 4472

Answers (1)

tehem
tehem

Reputation: 85

This is not exactly what you're looking for but you could load the op into a dataframe and iterate over it using the non-null values in the first row as column names for another dataframe. After that it is easy, you can just collate all the data between 2 column name columns in the output dataframe and insert it into the new dataframe after merging those cells.

Upvotes: 0

Related Questions