Reputation: 1227
I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses for 2014-2018 of each ID in the dataframe. There is however one condition: if one of the cells in the rows (2014-2018) is empty, NaN should be returned. So I only want the mean to be calculated when all 5 row-cells in the columns 2014-2018 have a numeric value.
Initial dataframe:
2014 2015 2016 2017 2018 ID
100 122.0 324 632 NaN 12.0
120 159.0 54 452 541.0 96.0
NaN 164.0 687 165 245.0 20.0
180 421.0 512 184 953.0 73.0
110 654.0 913 173 103.0 84.0
130 NaN 754 124 207.0 26.0
170 256.0 843 97 806.0 87.0
140 754.0 95 101 541.0 64.0
80 985.0 184 84 90.0 11.0
96 65.0 127 130 421.0 34.0
Desired output
2014 2015 2016 2017 2018 ID mean
100 122.0 324 632 NaN 12.0 NaN
120 159.0 54 452 541.0 96.0 265.20
NaN 164.0 687 165 245.0 20.0 NaN
180 421.0 512 184 953.0 73.0 450.00
110 654.0 913 173 103.0 84.0 390.60
130 NaN 754 124 207.0 26.0 NaN
170 256.0 843 97 806.0 87.0 434.40
140 754.0 95 101 541.0 64.0 326.20
80 985.0 184 84 90.0 11.0 284.60
96 65.0 127 130 421.0 34.0 167.80
Tried code: -> this however only gives me the mean, ignoring the NaN condition. Is their some brief lambda function that can add the condition to the code?
import pandas as pd
import numpy as np
data = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"2014": [100,120,np.nan,180,110,130,170,140,80,96],
"2015": [122,159,164,421,654,np.nan,256,754,985,65],
"2016": [324,54,687,512,913,754,843,95,184,127],
"2017": [632,452,165,184,173,124,97,101,84,130],
"2018": [np.nan,541,245,953,103,207,806,541,90,421]})
print(data)
fiveyear = ["2014", "2015", "2016", "2017", "2018"] -> if a cell in these rows is empty(NaN), then NaN should be in the new 'mean'-column. I only want the mean when, all 5 cells in the row have a numeric value.
data.loc[:, 'mean'] = data[fiveyear].mean(axis=1)
print(data)
Upvotes: 2
Views: 5805
Reputation: 716
This is the same as @ALollz answer but with a flexible way to detect all columns regardless of how many years there are in the df
#get years columns in a list
yearsCols= [c for c in df if c != 'ID']
#calculate mean
df['mean'] = df[yearsCols].dropna(how='any').mean(1)
Upvotes: 2
Reputation: 59549
Use dropna
to remove rows before calculating the mean. Because pandas will align on index when assigning the result back, and these rows were removed, the result of these dropped rows is NaN
df['mean'] = df[fiveyear].dropna(how='any').mean(1)
Also possible to mask
the result to only those rows that were all non-null
df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))
A bit more of a hack, but because you know you need all 5 values you could also use sum
which supports the min_count
argument, so anything with fewer than 5 values is NaN
df['mean'] = df[fiveyear].sum(1, min_count=len(fiveyear))/len(fiveyear)
Upvotes: 4