John
John

Reputation: 1273

pandas concat/merge and sum one column

I have two pandas.DataFrame objects with MultiIndex indices. Some of the index values are shared with the two dataframes, but not all. I would like to merge these two data frames and take the sum of one of the columns if the row (index value) exists. Otherwise, keep the row and column value as it exists.

: This is close, but does not use MultiIndex

I've tried to create an example:

def mklbl(prefix,n):
try:
    return ["%s%s" % (prefix,i)  for i in range(n)]
except:
    return ["%s%s" % (prefix,i) for i in n]



mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])

mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])

df2 = pd.DataFrame({'b':np.arange(len(mi2)), 'c':np.arange(len(mi2))[::-1]},
      index=mi2).sort_index().sort_index(axis=1)

df1 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1))[::-1]},
      index=mi1).sort_index().sort_index(axis=1)

The individual DataFrame objects look like:

In [117]: df1
Out[117]: 
       a  b
A0 C0  0  7
   C1  1  6
A1 C0  2  5
   C1  3  4
A2 C0  4  3
   C1  5  2
A3 C0  6  1
   C1  7  0

and

In [118]: df2
Out[118]: 
       b  c
A2 C0  0  5
   C1  1  4
A3 C0  2  3
   C1  3  2
A4 C0  4  1
   C1  5  0

What I want to do is merge these two, and sum the 'b' column, but keep all rows whether they exist in one or the other dataframe:

In [117]: df_merged_bsummed
Out[117]: 
       a  b  c
A0 C0  0  7  NaN
   C1  1  6  NaN
A1 C0  2  5  NaN
   C1  3  4  NaN
A2 C0  4  3  5
   C1  5  3  4
A3 C0  6  3  3
   C1  7  3  2
A4 C0  NaN 4  1
   C1  NaN 5  0

Upvotes: 4

Views: 7547

Answers (1)

DSM
DSM

Reputation: 353059

In this particular case, I think you could just add them and use fill_value=0, relying on the default alignment behaviour:

>>> df1.add(df2,fill_value=0)
        a  b   c
A0 C0   0  7 NaN
   C1   1  6 NaN
A1 C0   2  5 NaN
   C1   3  4 NaN
A2 C0   4  3   5
   C1   5  3   4
A3 C0   6  3   3
   C1   7  3   2
A4 C0 NaN  4   1
   C1 NaN  5   0

There being only one column in common, only one is summed, but if you wanted to make that explicit you could instead do something like

>>> m = pd.concat([df1, df2],axis=1)
>>> m["b"] = m.pop("b").sum(axis=1)
>>> m
        a   c  b
A0 C0   0 NaN  7
   C1   1 NaN  6
A1 C0   2 NaN  5
   C1   3 NaN  4
A2 C0   4   5  3
   C1   5   4  3
A3 C0   6   3  3
   C1   7   2  3
A4 C0 NaN   1  4
   C1 NaN   0  5

Upvotes: 8

Related Questions