Matthi9000
Matthi9000

Reputation: 1227

Only calculate mean of data rows in dataframe with no NaN-values

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses for 2014-2018 of each ID in the dataframe. There is however one condition: if one of the cells in the rows (2014-2018) is empty, NaN should be returned. So I only want the mean to be calculated when all 5 row-cells in the columns 2014-2018 have a numeric value.

Initial dataframe:

2014   2015  2016  2017   2018   ID
100  122.0   324   632    NaN  12.0
120  159.0    54   452  541.0  96.0
NaN  164.0   687   165  245.0  20.0
180  421.0   512   184  953.0  73.0
110  654.0   913   173  103.0  84.0
130    NaN   754   124  207.0  26.0
170  256.0   843    97  806.0  87.0
140  754.0    95   101  541.0  64.0
 80  985.0   184    84   90.0  11.0
 96   65.0   127   130  421.0  34.0

Desired output

2014   2015  2016  2017   2018    ID    mean
 100  122.0   324   632    NaN  12.0     NaN
 120  159.0    54   452  541.0  96.0  265.20
 NaN  164.0   687   165  245.0  20.0     NaN
 180  421.0   512   184  953.0  73.0  450.00
 110  654.0   913   173  103.0  84.0  390.60
 130    NaN   754   124  207.0  26.0     NaN
 170  256.0   843    97  806.0  87.0  434.40
 140  754.0    95   101  541.0  64.0  326.20
  80  985.0   184    84   90.0  11.0  284.60
  96   65.0   127   130  421.0  34.0  167.80

Tried code: -> this however only gives me the mean, ignoring the NaN condition. Is their some brief lambda function that can add the condition to the code?

import pandas as pd

import numpy as np



data = pd.DataFrame({"ID":   [12,96,20,73,84,26,87,64,11,34],
                     
                   "2014": [100,120,np.nan,180,110,130,170,140,80,96],
                     
                   "2015": [122,159,164,421,654,np.nan,256,754,985,65],

                   "2016": [324,54,687,512,913,754,843,95,184,127],
   
                   "2017": [632,452,165,184,173,124,97,101,84,130],
 
                   "2018": [np.nan,541,245,953,103,207,806,541,90,421]})



print(data)


fiveyear = ["2014", "2015", "2016", "2017", "2018"] -> if a cell in these rows is empty(NaN), then NaN should be in the new 'mean'-column. I only want the mean when, all 5 cells in the row have a numeric value.



data.loc[:, 'mean'] = data[fiveyear].mean(axis=1)


print(data)

Upvotes: 2

Views: 5805

Answers (2)

Mit
Mit

Reputation: 716

This is the same as @ALollz answer but with a flexible way to detect all columns regardless of how many years there are in the df

#get years columns in a list
yearsCols= [c for c in df if c != 'ID']

#calculate mean
df['mean'] = df[yearsCols].dropna(how='any').mean(1)

Upvotes: 2

ALollz
ALollz

Reputation: 59549

Use dropna to remove rows before calculating the mean. Because pandas will align on index when assigning the result back, and these rows were removed, the result of these dropped rows is NaN

df['mean'] = df[fiveyear].dropna(how='any').mean(1)

Also possible to mask the result to only those rows that were all non-null

df['mean'] = df[fiveyear].mean(1).mask(df[fiveyear].isnull().any(1))

A bit more of a hack, but because you know you need all 5 values you could also use sum which supports the min_count argument, so anything with fewer than 5 values is NaN

df['mean'] = df[fiveyear].sum(1, min_count=len(fiveyear))/len(fiveyear)

Upvotes: 4

Related Questions