Reputation: 471
I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first PDF page has column header. The headers of dataframes after page 1 becomes the first row on information. Is there any way that I can add the header from page 1 dataframe to the rest of the dataframes? Thanks in advance. Much appreciated!
Upvotes: 2
Views: 16171
Reputation: 196
One can solve this by following steps:
Read the PDF:
tables = tabula.read_pdf(filename, pages='all', pandas_options={'header': None})
This will create a list of dataframes, having pages as dataframe in the list.
pandas_options={'header': None} is used not to take first row as header in the dataframe.
So, the header of the first page will be first row of dataframe in tables list.
Saving header in a variable:
cols = tables[0].values.tolist()[0]
This will create a list named cols, having first row of first df in tables list which is our header.
Removing first row of first page:
tables[0] = tables[0].iloc[1:]
This line will remove first row of first df(page) in tables list, as we have already stored in a variable we do not need it anymore.
Giving header to all the pages:
for df in tables: df.columns = cols
This loop will iterate through every dfs(pages) and give them the header we stored in cols variable.
So the header from page 1 dataframe will be given to the rest of the dataframes(pages).
You can also concat it in one dataframe with
import pandas as pd
and:
df_Final = pd.concat(tables)
Hope this helps you, thanks for this oppurtunity.
Upvotes: 10