Reputation: 67
So I've looked at multiple potential solutions but none seem to work.
Basically, I want to create a new column in my data frame which is the mean of multiple other columns. I want this mean to exclude NaN values but still calculate the mean even if there are NaN values in the row.
I have a data frame which looks something like this (but actually Q222-229):
ID Q1 Q2 Q3 Q4 Q5
1 4 NaN NaN NaN NaN
2 5 7 8 NaN NaN
3 7 1 2 NaN NaN
4 2 2 3 4 1
5 1 3 NaN NaN NaN
And I want to create a column which is the mean of Q1, Q2, Q3, Q4, Q5 ie:
ID Q1 Q2 Q3 Q4 Q5 avg_age
1 4 NaN NaN NaN NaN 4
2 5 7 8 NaN NaN 5.5
3 7 1 2 NaN NaN 3.5
4 2 2 3 4 1 2
5 1 3 NaN NaN NaN 2
(ignore values)
However, every method I have tried returns NaN values in the avg_age column which is making me think that when ignoring the NaN values, pandas is ignoring the whole row. But I dont want this to happen, instead I want the mean returned with the NaN values ignored.
Here is what I have tried so far:
1.
avg_age = s.loc[: , "Q222":"Q229"]
avg_age = avg_age.mean(axis=1)
s = pd.concat([s, avg_age], axis=1)
2.
s['avg_age'] = s[['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']].mean(axis=1)
3.
avg_age = ['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']
s.loc[:, 'avg_age'] = s[avg_age].mean(axis=1)
I am not sure if there is something wrong with the way I have coded the values initially so here is my code for reference:
#Changing age variable inputs
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q223'] = s['Q223'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q224'] = s['Q224'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q225'] = s['Q225'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q226'] = s['Q226'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q227'] = s['Q227'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q228'] = s['Q228'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q229'] = s['Q229'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
['2','3','4','5', '6', '7', '8', np.NaN])
s['Q222'] = s['Q222'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q223'] = s['Q223'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q224'] = s['Q224'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q225'] = s['Q225'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q226'] = s['Q226'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q227'] = s['Q227'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q228'] = s['Q228'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q229'] = s['Q229'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
Thanks in advance to anyone who is able to help!
Upvotes: 0
Views: 5882
Reputation: 13821
skipna=True
Can get it with a list comprehension
to get the columns to average, and mean()
with:
df['ave_age'] = df[[col for col in df.columns if 'Q' in col]].mean(axis = 1,skipna = True)
Upvotes: 0
Reputation: 1367
The default behavior of DataFrame.mean()
should do what you want.
Here's an example showing taking a mean over a subset of the columns and placing it in a newly created column:
In[19]: tmp
Out[19]:
a b c
0 1 2 5.0
1 2 3 6.0
2 3 4 NaN
In[24]: tmp['mean'] = tmp[['b', 'c']].mean(axis=1)
In[25]: tmp
Out[25]:
a b c mean
0 1 2 5.0 3.5
1 2 3 6.0 4.5
2 3 4 NaN 4.0
As for what's going wrong in your code:
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"], ['2','3','4','5', '6', '7', '8', np.NaN])
You don't have numerical values (i.e 2, 3, 4) in your data frame, you have strings ('2', '3', and '4'). The DataFrame.mean()
function is treating these strings as NaN, so you are getting NaN as the result for all your mean calculations.
Try filling your frame with numbers, like so:
s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
[2, 3, 4, 5, 6, 7, 8, np.NaN])
Upvotes: 2