Harshil Modi
Harshil Modi

Reputation: 436

Replacing missing values with mean

I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:

'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'

So, I tried with my own dataset to know why it is not working. I created a file with the following contents

c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1

and I wrote the following script:

data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())

this is what I'm getting as output:

   c0   c1  type
0   1    2     0
1   2    3     0
2   2    4     0
3   1  NaN     1
4   1    3     1
            c0
type          
0     1.666667
1     1.000000

Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?

Really appreciate your help.

Upvotes: 2

Views: 370

Answers (2)

Dani Mesejo
Dani Mesejo

Reputation: 61910

The problem is that c1, is not of type numeric, do:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())

Output

            c0   c1
type               
0     1.666667  3.0
1     1.000000  3.0

When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):

c0       int64
c1      object
type     int64
dtype: object

If you want to replace the nan, with the mean of the group use transform + fillna:

data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)

res = data.groupby('type').transform('mean')
print(data.fillna(res))

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  3.0     1
4   1  3.0     1

As a last advise you could read the csv as:

data = pd.read_csv("data.csv", na_values='?')
print(data)

Output

   c0   c1  type
0   1  2.0     0
1   2  3.0     0
2   2  4.0     0
3   1  NaN     1
4   1  3.0     1

This will save you the need of converting the columns to numeric.

Upvotes: 3

wwnde
wwnde

Reputation: 26676

df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))

Upvotes: 2

Related Questions