How to force pdfplumber to extract table according to the number of columns in the upper row?

Question

I am trying to extract a table from PDF document with python package pdfplumber. The table has four columns and multiple rows. The first row are headers and the second row has only one merged cell, then the values are saved normally (example) pdfplumber was able to retrive the table, but it made 6 columns out if four and saved values not according to the columns.

Table as shown in PDF document

I tried to use various table settings, including "vertical strategy": "lines", but this yields me the same result.

# Python 2.7.16   
import pandas as pd
import pdfplumber

path = 'file_path'
pdf = pdfplumber.open(path) 
first_page = pdf.pages[7]
df5 = pd.DataFrame(first_page.extract_table())

getting six columns instead of four with values in wrong columns. Output example:

Table as output in jupyter notebooks

I would be happy to hear, if anybody has any suggestion, solution.

How to force pdfplumber to extract table according to the number of columns in the upper row?

Answers (1)

Related Questions