Matthias Gallagher
Matthias Gallagher

Reputation: 471

Using tabula.py to read table without header from PDF format

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first PDF page has column header. The headers of dataframes after page 1 becomes the first row on information. Is there any way that I can add the header from page 1 dataframe to the rest of the dataframes? Thanks in advance. Much appreciated!

Upvotes: 2

Views: 16171

Answers (1)

Kathan Thakkar
Kathan Thakkar

Reputation: 196

One can solve this by following steps:

  1. Read the PDF:

    tables = tabula.read_pdf(filename, pages='all', pandas_options={'header': None})

This will create a list of dataframes, having pages as dataframe in the list.

pandas_options={'header': None} is used not to take first row as header in the dataframe.

So, the header of the first page will be first row of dataframe in tables list.

  1. Saving header in a variable:

    cols = tables[0].values.tolist()[0]

This will create a list named cols, having first row of first df in tables list which is our header.

  1. Removing first row of first page:

    tables[0] = tables[0].iloc[1:]

This line will remove first row of first df(page) in tables list, as we have already stored in a variable we do not need it anymore.

  1. Giving header to all the pages:

    for df in tables: df.columns = cols

This loop will iterate through every dfs(pages) and give them the header we stored in cols variable.

So the header from page 1 dataframe will be given to the rest of the dataframes(pages).

You can also concat it in one dataframe with

import pandas as pd

and:

df_Final = pd.concat(tables)

Hope this helps you, thanks for this oppurtunity.

Upvotes: 10

Related Questions