Create a column which is the mean of multiple columns in a data frame in pandas

Question

So I've looked at multiple potential solutions but none seem to work.

Basically, I want to create a new column in my data frame which is the mean of multiple other columns. I want this mean to exclude NaN values but still calculate the mean even if there are NaN values in the row.

I have a data frame which looks something like this (but actually Q222-229):

ID   Q1   Q2   Q3   Q4   Q5
1    4    NaN  NaN  NaN  NaN
2    5    7    8    NaN  NaN
3    7    1    2    NaN  NaN
4    2    2    3    4    1
5    1    3    NaN  NaN  NaN

And I want to create a column which is the mean of Q1, Q2, Q3, Q4, Q5 ie:

ID   Q1   Q2   Q3   Q4   Q5   avg_age
1    4    NaN  NaN  NaN  NaN  4
2    5    7    8    NaN  NaN  5.5
3    7    1    2    NaN  NaN  3.5
4    2    2    3    4    1    2
5    1    3    NaN  NaN  NaN  2

(ignore values)

However, every method I have tried returns NaN values in the avg_age column which is making me think that when ignoring the NaN values, pandas is ignoring the whole row. But I dont want this to happen, instead I want the mean returned with the NaN values ignored.

Here is what I have tried so far:

1.
    avg_age = s.loc[: , "Q222":"Q229"]
    avg_age = avg_age.mean(axis=1)
    s = pd.concat([s, avg_age], axis=1)

2.
    s['avg_age'] = s[['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']].mean(axis=1)

3.

    avg_age = ['Q222', 'Q223', 'Q224', 'Q225', 'Q226', 'Q227', 'Q228', 'Q229']
    s.loc[:, 'avg_age'] = s[avg_age].mean(axis=1)

I am not sure if there is something wrong with the way I have coded the values initially so here is my code for reference:

#Changing age variable inputs

s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q223'] = s['Q223'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q224'] = s['Q224'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q225'] = s['Q225'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q226'] = s['Q226'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q227'] = s['Q227'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q228'] = s['Q228'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])
s['Q229'] = s['Q229'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                              ['2','3','4','5', '6', '7', '8', np.NaN])

s['Q222'] = s['Q222'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q223'] = s['Q223'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q224'] = s['Q224'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q225'] = s['Q225'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q226'] = s['Q226'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q227'] = s['Q227'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q228'] = s['Q228'].replace(['0-4', '05-11', '12-15', '16-17'], '1')
s['Q229'] = s['Q229'].replace(['0-4', '05-11', '12-15', '16-17'], '1')

Thanks in advance to anyone who is able to help!

The Photon · Accepted Answer

The default behavior of DataFrame.mean() should do what you want.

Here's an example showing taking a mean over a subset of the columns and placing it in a newly created column:

In[19]: tmp
Out[19]: 
   a  b    c
0  1  2  5.0
1  2  3  6.0
2  3  4  NaN

In[24]: tmp['mean'] = tmp[['b', 'c']].mean(axis=1)

In[25]: tmp
Out[25]: 
   a  b    c  mean
0  1  2  5.0   3.5
1  2  3  6.0   4.5
2  3  4  NaN   4.0

As for what's going wrong in your code:

s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                         ['2','3','4','5', '6', '7', '8', np.NaN])

You don't have numerical values (i.e 2, 3, 4) in your data frame, you have strings ('2', '3', and '4'). The DataFrame.mean() function is treating these strings as NaN, so you are getting NaN as the result for all your mean calculations.

Try filling your frame with numbers, like so:

 s['Q222'] = s['Q222'].replace(['18-24', '25-34','35-44', '45-54','55-64', '65-74', '75 or older', "Don't know"],
                          [2, 3, 4, 5, 6, 7, 8, np.NaN])

Create a column which is the mean of multiple columns in a data frame in pandas

Answers (2)

Related Questions