Reputation: 115
I am working to create a database from different existing ones. After merging the information I need from them I get some rows that were repeated in both of them.
2018-11-22 Iraq 13984.75 3000.0 NaN
2018-11-22 Iraq NaN NaN Heavy Rain
Desired output:
2018-11-22 Iraq 13984.75 3000.0 Heavy Rain
Now I want to merge these into a single one. Almost every value as seen in the image are just present in one of the rows and the other one has NaN value in its place. So I would like to substitute NaN by the value in the other row. However, there might be values which are defined in both rows such as the ending date in this case where I would like to keep the larger one.
Is there a way to do this with pandas?
Upvotes: 1
Views: 2460
Reputation: 862661
I believe you need:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,50,5,np.nan],
'C':[7,np.nan,9,4,2,3],
'E':[np.nan,30,60,9,np.nan,4],
'F':['s','d','f',np.nan,'r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))
df2 = pd.DataFrame({
'A':list('ertyui'),
'B':[4,np.nan,6,5,5,8],
'C':[7,np.nan,9,20,2,3],
'E':[8,np.nan,3,6,90,np.nan],
'F':[np.nan,'d',np.nan,'f','r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))
First concat
both DataFrames
:
df = pd.concat([df1, df2])
print (df)
A B C E F
2011-01-01 a 4.0 7.0 NaN s
2011-01-02 b NaN NaN 30.0 d
2011-01-03 c 4.0 9.0 60.0 f
2011-01-04 d 50.0 4.0 9.0 NaN
2011-01-05 e 5.0 2.0 NaN r
2011-01-06 f NaN 3.0 4.0 NaN
2011-01-01 e 4.0 7.0 8.0 NaN
2011-01-02 r NaN NaN NaN d
2011-01-03 t 6.0 9.0 3.0 NaN
2011-01-04 y 5.0 20.0 6.0 f
2011-01-05 u 5.0 2.0 90.0 r
2011-01-06 i 8.0 3.0 NaN NaN
Then select only numeric columns with select_dtypes
and aggregate max
per index:
df11 = df.select_dtypes(np.number).max(level=0)
print (df11)
B C E
2011-01-01 4.0 7.0 8.0
2011-01-02 NaN NaN 30.0
2011-01-03 6.0 9.0 60.0
2011-01-04 50.0 20.0 9.0
2011-01-05 5.0 2.0 90.0
2011-01-06 8.0 3.0 4.0
For strings columns aggregate first
- first non NaN value per group:
df12 = df.select_dtypes(object).groupby(level=0).first()
print (df12)
A F
2011-01-01 a s
2011-01-02 b d
2011-01-03 c f
2011-01-04 d f
2011-01-05 e r
2011-01-06 f NaN
Last join together and for same order of columns use reindex
:
out = pd.concat([df11, df12], axis=1).reindex(columns=df.columns)
print (out)
A B C E F
2011-01-01 a 4.0 7.0 8.0 s
2011-01-02 b NaN NaN 30.0 d
2011-01-03 c 6.0 9.0 60.0 f
2011-01-04 d 50.0 20.0 9.0 f
2011-01-05 e 5.0 2.0 90.0 r
2011-01-06 f 8.0 3.0 4.0 NaN
Upvotes: 4