Javi Rando
Javi Rando

Reputation: 115

Merge information from rows with same index in a single row with pandas

I am working to create a database from different existing ones. After merging the information I need from them I get some rows that were repeated in both of them.

2018-11-22 Iraq 13984.75 3000.0 NaN
2018-11-22 Iraq NaN NaN Heavy Rain

Desired output:

2018-11-22 Iraq 13984.75 3000.0 Heavy Rain

Now I want to merge these into a single one. Almost every value as seen in the image are just present in one of the rows and the other one has NaN value in its place. So I would like to substitute NaN by the value in the other row. However, there might be values which are defined in both rows such as the ending date in this case where I would like to keep the larger one.

Is there a way to do this with pandas?

Upvotes: 1

Views: 2460

Answers (1)

jezrael
jezrael

Reputation: 862661

I believe you need:

df1 = pd.DataFrame({
        'A':list('abcdef'),
         'B':[4,np.nan,4,50,5,np.nan],
         'C':[7,np.nan,9,4,2,3],
         'E':[np.nan,30,60,9,np.nan,4],
         'F':['s','d','f',np.nan,'r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))


df2 = pd.DataFrame({
        'A':list('ertyui'),
         'B':[4,np.nan,6,5,5,8],
         'C':[7,np.nan,9,20,2,3],
         'E':[8,np.nan,3,6,90,np.nan],
         'F':[np.nan,'d',np.nan,'f','r',np.nan]
}, index=pd.date_range('2011-01-01', periods=6))

First concat both DataFrames:

df = pd.concat([df1, df2])
print (df)
            A     B     C     E    F
2011-01-01  a   4.0   7.0   NaN    s
2011-01-02  b   NaN   NaN  30.0    d
2011-01-03  c   4.0   9.0  60.0    f
2011-01-04  d  50.0   4.0   9.0  NaN
2011-01-05  e   5.0   2.0   NaN    r
2011-01-06  f   NaN   3.0   4.0  NaN
2011-01-01  e   4.0   7.0   8.0  NaN
2011-01-02  r   NaN   NaN   NaN    d
2011-01-03  t   6.0   9.0   3.0  NaN
2011-01-04  y   5.0  20.0   6.0    f
2011-01-05  u   5.0   2.0  90.0    r
2011-01-06  i   8.0   3.0   NaN  NaN

Then select only numeric columns with select_dtypes and aggregate max per index:

df11 = df.select_dtypes(np.number).max(level=0)
print (df11)
               B     C     E
2011-01-01   4.0   7.0   8.0
2011-01-02   NaN   NaN  30.0
2011-01-03   6.0   9.0  60.0
2011-01-04  50.0  20.0   9.0
2011-01-05   5.0   2.0  90.0
2011-01-06   8.0   3.0   4.0

For strings columns aggregate first - first non NaN value per group:

df12 = df.select_dtypes(object).groupby(level=0).first()
print (df12)
            A    F
2011-01-01  a    s
2011-01-02  b    d
2011-01-03  c    f
2011-01-04  d    f
2011-01-05  e    r
2011-01-06  f  NaN

Last join together and for same order of columns use reindex:

out = pd.concat([df11, df12], axis=1).reindex(columns=df.columns)
print (out)
            A     B     C     E    F
2011-01-01  a   4.0   7.0   8.0    s
2011-01-02  b   NaN   NaN  30.0    d
2011-01-03  c   6.0   9.0  60.0    f
2011-01-04  d  50.0  20.0   9.0    f
2011-01-05  e   5.0   2.0  90.0    r
2011-01-06  f   8.0   3.0   4.0  NaN

Upvotes: 4

Related Questions