summarizing data frame in pandas - python

Question

df = pd.DataFrame({'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],'d':[1,0,0,1,1,1,0,1,0]})

I'm trying to summarize this dataframe using sum. I thought df.groupby(['a','b']).aggregate(sum) would work but it returns an empty Series.

How can I achieve this result?

   a  b
x  1  1
y  2  1

unutbu · Accepted Answer

import numpy as np
import pandas as pd
NaN = np.nan

df = pd.DataFrame(
    {'a':['y',NaN,'y',NaN,NaN,'x','x','y',NaN],
     'b':[NaN,'x',NaN,'y','x',NaN,NaN,NaN,'y'],
     'd':[32,12,55,98,23,11,9,91,3]})

melted = pd.melt(df, id_vars=['d'], value_vars=['a', 'b'])
result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'], 
                        aggfunc=np.median)
print(result)

yields

variable     a     b
value               
x         10.0  17.5
y         55.0  50.5

Explanation:

Melting the DataFrame with melted = pd.melt(df, value_vars=['a', 'b']) produces

     d variable value
0   32        a     y
1   12        a   NaN
2   55        a     y
3   98        a   NaN
4   23        a   NaN
5   11        a     x
6    9        a     x
7   91        a     y
8    3        a   NaN
9   32        b   NaN
10  12        b     x
11  55        b   NaN
12  98        b     y
13  23        b     x
14  11        b   NaN
15   9        b   NaN
16  91        b   NaN
17   3        b     y

and now we can use pd.pivot_table to pivot and aggregate the d values:

result = pd.pivot_table(melted, values='d', index=['value'], columns=['variable'], 
                        aggfunc=np.median)

Note that the aggfunc can take a list of functions, such as [np.sum, np.median, np.min, np.max, np.std] if you wish to summarize the data in more than one way.

summarizing data frame in pandas - python

Answers (1)

Related Questions