Reputation: 105

Merging two or more continuous tables from a list in Python

I have a list of dataframes. These are tables that have been extracted from a pdf file. The problem that I am facing is that two of the tables that have been extracted are a continuation of each other. Please see example below:

import pandas as pd

df1 = {'0':['Category','Home Services','Mobile','Internet'],'1':['Number',1,1,2]}
df2 = {'0':['Job Title','Associate Partner','Developer 1','Developer 2'],'1':['Daily Rate',1000,1000,1000]}
df3 = {'0':['UI Developer','UX Designer','Delivery Manager'],'1':[1000,1000,1000]}
df4 = {'0':['Name','John','Sara','Mike'],'1':['Age',25,30,40]}

Here df3 is a continuation of df2. Next I put them in a list. The library that I used to extract the tables is camelot and it returns a list of dataframe. So I reproduced the final result below:

my_df1 = pd.DataFrame(df1)
my_df2 = pd.DataFrame(df2)
my_df3 = pd.DataFrame(df3)
my_df4 = pd.DataFrame(df4)
my_df_list = [my_df1,my_df2,my_df3,my_df4]

There can be n number of tables extracted from the pdf file. How do I merge the tables into one that are a continuation of the other? In my example df2 and df3. Is there a library that can do this? or some way that this can be achieved? Your help is generously appreciated.

Cheers, Abhishek

Upvotes: 0

Answers (3)

rish_hyun

Reputation: 481

We are on the same boat
You can try this

# tabula-py
# camelot
# PyPDF2
# pdf2text
# pdftotext

import tabula as tb
import camelot as cm
import pandas as pd

file = "Oct_11.pdf"

# tabula
tb.convert_into(file, file.split('.')[0]+'_tabula.csv', pages="all")

# camelot
tables = cm.read_pdf(file, pages='all')
df = pd.concat([tab.df for tab in tables], ignore_index=True)
df.to_csv(file.split('.')[0]+'_camelot.csv')

# PyPDF2
# pending

# pdf2text
# pending

# pdf2text
# pending

print('Program finished!')

If you need help regarding extracting tables, I will update my answer in future.
As of now you only need
df = pd.concat([tab.df for tab in tables], ignore_index=True)
Don't forget to change the file name!

Upvotes: 0

Georgina Skibinski

Reputation: 13387

IIUC:

df=pd.concat([pd.DataFrame(el) for el in [df1, df2, df3, df4]], axis=1)
df.columns=df.loc[0]
df=df.loc[1:]

Outputs:

0       Category Number  ...  Name Age
1  Home Services      1  ...  John  25
2         Mobile      1  ...  Sara  30
3       Internet      2  ...  Mike  40

Upvotes: 1

gmdev

Reputation: 3155

You said that "The problem that I am facing is that two of the tables that have been extracted are a continuation of each other." So if I understand your question correctly, you can use the pandas concat() function to concatenate two dataframes.

df = pd.concat([my_df2,my_df3])
print(df)

Output:

                   0           1
0          Job Title  Daily Rate
1  Associate Partner        1000
2        Developer 1        1000
3        Developer 2        1000
0       UI Developer        1000
1        UX Designer        1000
2   Delivery Manager        1000

Upvotes: 1

Merging two or more continuous tables from a list in Python

Answers (3)

Related Questions