Frank Zhang
Frank Zhang

Reputation: 1688

How to calculate the average while containg np.nan using np.average? (While using groupby)

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C']=[5,2,np.nan,4,np.nan]
df['index']=df.index

The DataFrame:

  A  B   C   index
0 0  1   5     0
1 2  3   2     1
2 4  5   NaN   2
3 6  7   4     3
4 8  9   NaN   4

I want to use groupby funciton to calculate the average of "A""B""C" for every index.

df1 = df.groupby('index')
average = df1.apply(lambda df,x,y :np.average(df.iloc[:,x:y]),0,3)

The average:

index
0     2.00
1     2.33
2     NaN
3     5.67
4     NaN
dtype:float64

what I want to do is that :

  1. when A B C are digits,the average = (A+B+C)/3

  2. when one of ABC is np.nan, the average = sum/2

Upvotes: 1

Views: 173

Answers (2)

Sebastian Wozny
Sebastian Wozny

Reputation: 17506

Why use numpy instead of pandas.mean()?

df
Out[207]: 
   A  B    C
0  0  1  5.0
1  2  3  2.0
2  4  5  NaN
3  6  7  4.0
4  8  9  NaN

In [210]: df.mean(axis=1)
Out[210]: 
0    2.000000
1    2.333333
2    4.500000
3    5.666667
4    8.500000
dtype: float64

It does just what you need, and also works on groupby objects.

Upvotes: 0

jezrael
jezrael

Reputation: 862681

You can use custom function:

You can use GroupBy.mean - NaN are omited:

#groupby by index
df1 = df.groupby(level=0)
#groupby by column index
#df1 = df.groupby('index')

#filter columns for apply first, item() is for return scalar
average = df1['A','B','C'].apply(lambda x: x.mean(axis=1).item())
print (average)
0    2.000000
1    2.333333
2    4.500000
3    5.666667
4    8.500000
dtype: float64

Or with idea from Paul Panzer comment:

df1 = df.groupby(level=0)
average = df1['A','B','C'].apply(np.nanmean)
print (average)
0    2.000000
1    2.333333
2    4.500000
3    5.666667
4    8.500000
dtype: float64

Also groupby is not necessary, only select desired columns and use DataFrame.mean with axis=1:

df['new'] = df[['A','B','C']].mean(axis=1)
print (df)
   A  B    C  index       new
0  0  1  5.0      0  2.000000
1  2  3  2.0      1  2.333333
2  4  5  NaN      2  4.500000
3  6  7  4.0      3  5.666667
4  8  9  NaN      4  8.500000

print (df)
   A  B    C
0  0  1  5.0
1  2  3  2.0
2  4  5  NaN
3  6  7  4.0
4  8  9  NaN

#if need mean of all columns
df['new'] = df.mean(axis=1)
print (df)
   A  B    C       new
0  0  1  5.0  2.000000
1  2  3  2.0  2.333333
2  4  5  NaN  4.500000
3  6  7  4.0  5.666667
4  8  9  NaN  8.500000

Upvotes: 1

Related Questions