Pandas: Joining dataframes from different sources

Question

Have the following datasets from two different sources i.e. Oracle and MySQL:

DF1 (Oracle):

A   B       C
    1122    8827
822 8282    6622
727 72      1183
    91      5092
992 113     7281

DF2 (MySQL):

E   F       G
    8827    6363
822 5526    9393
727 928     6671
    9221    7282
992 921     7262
    445     6298

Need to join these in pandas such that the below result is obtained.

Expected o/p:

A   B       C       F       G   

822 8282    6622    5526    9393
727 72      1183    928     6671
992 113     7281    921     7262
    1122    8827    
    91      5092
                    8827    6363
                    445     6298

Update_1:

As suggested, tried the following:

import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print df1

data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print df2


DF11 = df1.set_index(df1['A'].fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

getting the following:

    A   B       C       F       G
0   727 72.0    1183.0  928.0   6671.0
1   822 8282.0  6622.0  5526.0  9393.0
2   992 113.0   7281.0  921.0   7262.0
3       1122.0  8827.0  8827.0  6363.0
4       1122.0  8827.0  9221.0  7282.0
5       1122.0  8827.0  445.0   6298.0
6       91.0    5092.0  8827.0  6363.0
7       91.0    5092.0  9221.0  7282.0
8       91.0    5092.0  445.0   6298.0

Q: How to avoid the repetition of values and get the expected o/p?

Scott Boston · Accepted Answer

Your problem is complicated by nulls in the join key. You try some logic like this to achieve your result, or create a different key for joins that doesn't have nulls.

DF11 = DF1.set_index(DF1['A'].fillna(DF1.groupby('A').cumcount().astype(str)+'A'))
DF22 = DF2.set_index(DF2['E'].fillna(DF2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

Output:

       A       B       C       F       G
0    NaN  1122.0  8827.0     NaN     NaN
1  822.0  8282.0  6622.0  5526.0  9393.0
2  727.0    72.0  1183.0   928.0  6671.0
3    NaN    91.0  5092.0     NaN     NaN
4  992.0   113.0  7281.0   921.0  7262.0
5    NaN     NaN     NaN  8827.0  6363.0
6    NaN     NaN     NaN  9221.0  7282.0
7    NaN     NaN     NaN   445.0  6298.0

Update, due the fact your data has blanks and not np.nan, I had to add a method in those statement to replace '' with np.nan to get fillna to work correctly.

df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))

Try this:

import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print(df1)

data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print(df2)


DF11 = df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].replace('',np.nan).fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

Pandas: Joining dataframes from different sources

Answers (2)

Related Questions