Reputation: 471

Using tabula.py to read table without header from PDF format

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first PDF page has column header. The headers of dataframes after page 1 becomes the first row on information. Is there any way that I can add the header from page 1 dataframe to the rest of the dataframes? Thanks in advance. Much appreciated!

Upvotes: 2

Answers (1)

Kathan Thakkar

Reputation: 196

One can solve this by following steps:

Read the PDF:

tables = tabula.read_pdf(filename, pages='all', pandas_options={'header': None})

This will create a list of dataframes, having pages as dataframe in the list.

pandas_options={'header': None} is used not to take first row as header in the dataframe.

So, the header of the first page will be first row of dataframe in tables list.

Saving header in a variable:

cols = tables[0].values.tolist()[0]

This will create a list named cols, having first row of first df in tables list which is our header.

Removing first row of first page:

tables[0] = tables[0].iloc[1:]

This line will remove first row of first df(page) in tables list, as we have already stored in a variable we do not need it anymore.

Giving header to all the pages:

for df in tables: df.columns = cols

This loop will iterate through every dfs(pages) and give them the header we stored in cols variable.

So the header from page 1 dataframe will be given to the rest of the dataframes(pages).

You can also concat it in one dataframe with

import pandas as pd

and:

df_Final = pd.concat(tables)

Hope this helps you, thanks for this oppurtunity.

Upvotes: 10

Using tabula.py to read table without header from PDF format

Answers (1)

Related Questions