Reputation: 371
Have the following datasets from two different sources i.e. Oracle and MySQL:
DF1 (Oracle):
A B C
1122 8827
822 8282 6622
727 72 1183
91 5092
992 113 7281
DF2 (MySQL):
E F G
8827 6363
822 5526 9393
727 928 6671
9221 7282
992 921 7262
445 6298
Need to join these in pandas such that the below result is obtained.
Expected o/p:
A B C F G
822 8282 6622 5526 9393
727 72 1183 928 6671
992 113 7281 921 7262
1122 8827
91 5092
8827 6363
445 6298
Update_1:
As suggested, tried the following:
import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print df1
data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print df2
DF11 = df1.set_index(df1['A'].fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)
getting the following:
A B C F G
0 727 72.0 1183.0 928.0 6671.0
1 822 8282.0 6622.0 5526.0 9393.0
2 992 113.0 7281.0 921.0 7262.0
3 1122.0 8827.0 8827.0 6363.0
4 1122.0 8827.0 9221.0 7282.0
5 1122.0 8827.0 445.0 6298.0
6 91.0 5092.0 8827.0 6363.0
7 91.0 5092.0 9221.0 7282.0
8 91.0 5092.0 445.0 6298.0
Q: How to avoid the repetition of values and get the expected o/p?
Upvotes: 0
Views: 53
Reputation: 153460
Your problem is complicated by nulls in the join key. You try some logic like this to achieve your result, or create a different key for joins that doesn't have nulls.
DF11 = DF1.set_index(DF1['A'].fillna(DF1.groupby('A').cumcount().astype(str)+'A'))
DF22 = DF2.set_index(DF2['E'].fillna(DF2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)
Output:
A B C F G
0 NaN 1122.0 8827.0 NaN NaN
1 822.0 8282.0 6622.0 5526.0 9393.0
2 727.0 72.0 1183.0 928.0 6671.0
3 NaN 91.0 5092.0 NaN NaN
4 992.0 113.0 7281.0 921.0 7262.0
5 NaN NaN NaN 8827.0 6363.0
6 NaN NaN NaN 9221.0 7282.0
7 NaN NaN NaN 445.0 6298.0
Update, due the fact your data has blanks and not np.nan, I had to add a method in those statement to replace '' with np.nan to get fillna to work correctly.
df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
Try this:
import pandas as pd
data1 = [['',1122,8827],[822,8282,6622],[727,72,1183],['',91,5092],[992,113,7281]]
df1 = pd.DataFrame(data1,columns=['A','B','C'],dtype=float)
print(df1)
data2 = [['',8827,6363],[822,5526,9393],[727,928,6671],['',9221,7282],[992,921,7262],['',445,6298]]
df2 = pd.DataFrame(data2,columns=['E','F','G'],dtype=float)
print(df2)
DF11 = df1.set_index(df1['A'].replace('',np.nan).fillna(df1.groupby('A').cumcount().astype(str)+'A'))
DF22 = df2.set_index(df2['E'].replace('',np.nan).fillna(df2.groupby(['E']).cumcount().astype(str)+'E'))
DF11.merge(DF22, left_index=True, right_index=True, how='outer')\
.reset_index(drop=True)\
.drop('E', axis=1)
Upvotes: 2
Reputation: 11
Question, for your desired output, did you intentionally leave out column E?
if not...
I'm not sure whether or not the dataframes coming from different sources would have any bearing on how they would be joined together.
import pandas as pd
...
frames = [DF1, DF2]
result = pd.concat(frames)
This should perform the join you want to accomplish.
Upvotes: 0