Reputation: 105
I have a list of dataframes. These are tables that have been extracted from a pdf file. The problem that I am facing is that two of the tables that have been extracted are a continuation of each other. Please see example below:
import pandas as pd
df1 = {'0':['Category','Home Services','Mobile','Internet'],'1':['Number',1,1,2]}
df2 = {'0':['Job Title','Associate Partner','Developer 1','Developer 2'],'1':['Daily Rate',1000,1000,1000]}
df3 = {'0':['UI Developer','UX Designer','Delivery Manager'],'1':[1000,1000,1000]}
df4 = {'0':['Name','John','Sara','Mike'],'1':['Age',25,30,40]}
Here df3 is a continuation of df2. Next I put them in a list. The library that I used to extract the tables is camelot and it returns a list of dataframe. So I reproduced the final result below:
my_df1 = pd.DataFrame(df1)
my_df2 = pd.DataFrame(df2)
my_df3 = pd.DataFrame(df3)
my_df4 = pd.DataFrame(df4)
my_df_list = [my_df1,my_df2,my_df3,my_df4]
There can be n number of tables extracted from the pdf file. How do I merge the tables into one that are a continuation of the other? In my example df2 and df3. Is there a library that can do this? or some way that this can be achieved? Your help is generously appreciated.
Cheers, Abhishek
Upvotes: 0
Views: 2266
Reputation: 481
We are on the same boat
You can try this
# tabula-py
# camelot
# PyPDF2
# pdf2text
# pdftotext
import tabula as tb
import camelot as cm
import pandas as pd
file = "Oct_11.pdf"
# tabula
tb.convert_into(file, file.split('.')[0]+'_tabula.csv', pages="all")
# camelot
tables = cm.read_pdf(file, pages='all')
df = pd.concat([tab.df for tab in tables], ignore_index=True)
df.to_csv(file.split('.')[0]+'_camelot.csv')
# PyPDF2
# pending
# pdf2text
# pending
# pdf2text
# pending
print('Program finished!')
If you need help regarding extracting tables, I will update my answer in future.
As of now you only need
df = pd.concat([tab.df for tab in tables], ignore_index=True)
Don't forget to change the file name!
Upvotes: 0
Reputation: 13387
IIUC:
df=pd.concat([pd.DataFrame(el) for el in [df1, df2, df3, df4]], axis=1)
df.columns=df.loc[0]
df=df.loc[1:]
Outputs:
0 Category Number ... Name Age
1 Home Services 1 ... John 25
2 Mobile 1 ... Sara 30
3 Internet 2 ... Mike 40
Upvotes: 1
Reputation: 3155
You said that "The problem that I am facing is that two of the tables that have been extracted are a continuation of each other." So if I understand your question correctly, you can use the pandas
concat()
function to concatenate two dataframes.
df = pd.concat([my_df2,my_df3])
print(df)
Output:
0 1
0 Job Title Daily Rate
1 Associate Partner 1000
2 Developer 1 1000
3 Developer 2 1000
0 UI Developer 1000
1 UX Designer 1000
2 Delivery Manager 1000
Upvotes: 1