Alpha001
Alpha001

Reputation: 371

Pandas: Joining dataframes from different sources

Have the following datasets from two different sources i.e. Oracle and MySQL:

DF1 (Oracle):

A   B       C
    1122    8827
822 8282    6622
727 72      1183
    91      5092
992 113     7281

DF2 (MySQL):

E   F       G
    8827    6363
822 5526    9393
727 928     6671
    9221    7282
992 921     7262
    445     6298

Need to join these in pandas such that the below result is obtained.

Expected o/p:

A   B       C       F       G   

822 8282    6622    5526    9393
727 72      1183    928     6671
992 113     7281    921     7262
    1122    8827    
    91      5092
                    8827    6363
                    445     6298

Update_1:

As suggested, tried the following:

import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print df1

data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print df2


DF11 = df1.set_index(df1['A'].fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

getting the following:

    A   B       C       F       G
0   727 72.0    1183.0  928.0   6671.0
1   822 8282.0  6622.0  5526.0  9393.0
2   992 113.0   7281.0  921.0   7262.0
3       1122.0  8827.0  8827.0  6363.0
4       1122.0  8827.0  9221.0  7282.0
5       1122.0  8827.0  445.0   6298.0
6       91.0    5092.0  8827.0  6363.0
7       91.0    5092.0  9221.0  7282.0
8       91.0    5092.0  445.0   6298.0

Q: How to avoid the repetition of values and get the expected o/p?

Upvotes: 0

Views: 53

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

Your problem is complicated by nulls in the join key. You try some logic like this to achieve your result, or create a different key for joins that doesn't have nulls.

DF11 = DF1.set_index(DF1['A'].fillna(DF1.groupby('A').cumcount().astype(str)+'A'))
DF22 = DF2.set_index(DF2['E'].fillna(DF2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

Output:

       A       B       C       F       G
0    NaN  1122.0  8827.0     NaN     NaN
1  822.0  8282.0  6622.0  5526.0  9393.0
2  727.0    72.0  1183.0   928.0  6671.0
3    NaN    91.0  5092.0     NaN     NaN
4  992.0   113.0  7281.0   921.0  7262.0
5    NaN     NaN     NaN  8827.0  6363.0
6    NaN     NaN     NaN  9221.0  7282.0
7    NaN     NaN     NaN   445.0  6298.0

Update, due the fact your data has blanks and not np.nan, I had to add a method in those statement to replace '' with np.nan to get fillna to work correctly.

df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))

Try this:

import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print(df1)

data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print(df2)


DF11 = df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].replace('',np.nan).fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
    .reset_index(drop=True)\
    .drop('E', axis=1)

Upvotes: 2

Zachary
Zachary

Reputation: 11

Question, for your desired output, did you intentionally leave out column E?

if not...

I'm not sure whether or not the dataframes coming from different sources would have any bearing on how they would be joined together.

import pandas as pd
...
frames = [DF1, DF2]
result = pd.concat(frames)

This should perform the join you want to accomplish.

Upvotes: 0

Related Questions