Jim
Jim

Reputation: 415

How to Pandas fillna() with mode of column?

I have a data set in which there is a column known as 'Native Country' which contain around 30000 records. Some are missing represented by NaN so I thought to fill it with mode() value. I wrote something like this:

data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)

However when I do a count of missing values:

for col_name in data.columns: 
    print ("column:",col_name,".Missing:",sum(data[col_name].isnull()))

It is still coming up with the same number of NaN values for the column Native Country.

Upvotes: 38

Views: 117860

Answers (8)

paulduf
paulduf

Reputation: 311

So, I note that df.mean() returns a pd.Series whereas df.mode called on a dataset with mixed types (both numeric and categorical in my case) returns a pd.DataFrame with the same columns as df and row 0 giving the mode. This is expected because a Series' type must be unique, but still causes df.fillna(df.mode()) to fail where df.fillna(df.mean()) works.

Here is a one-liner to circumvent the issue in this case:

df.fillna({k: v[0] for k, v in df.mode().to_dict().items()})

Another issue is still that the first value v[0] is selected among a possible list of modes, as pointed out by this answer, but this can still be improved by applying another aggregation function to v.

Upvotes: 0

Abdelrahman Abozied
Abdelrahman Abozied

Reputation: 23

You can get the number 'mode' or any other strategy

  1. for mode:
    num = data['Native Country'].mode()[0]
    data['Native Country'].fillna(num, inplace=True)
  1. for mean, median:
    num = data['Native Country'].mean() #or median(); No need of [0] because it returns a float value.
    data['Native Country'].fillna(num, inplace=True)

or in one line like this

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

Upvotes: 1

Vojtech Stas
Vojtech Stas

Reputation: 751

For those who came here (as I did) to fill NAs in multiple columns, grouped by multiple columns and have problem that mode returns nothing, where there are only NA values in the group:

df[['col_to_fill_NA_1','col_to_fill_NA_2']] = df.groupby(['col_to_group_by_1', 'col_to_group_by_2'], dropna=False)[['col_to_fill_NA_1','col_to_fill_NA_2']].transform(lambda x: x.fillna(x.mode()[0]) if len(x.mode()) == 1 else x)

you can fill any number of "col_to_fill_NA" and make group by any number of "col_to_group_by". The if statement returns mode if mode exists and returns NAs for the groups, where there are only NAs.

Upvotes: 0

user3067175
user3067175

Reputation: 170

import numpy as np

import pandas as pd

print(pd.__version__)

1.2.0

df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})
Country Purchased
0 NaN NaN
1 France Yes
2 NaN Yes
3 Spain No
4 France NaN
 df.fillna(df.mode())  ## only applied on first row because df.mode() returns a dataframe with one row
Country Purchased
0 France Yes
1 France Yes
2 NaN Yes
3 Spain No
4 France NaN
df = pd.DataFrame({'Country': [np.nan, 'France', np.nan, 'Spain', 'France'], 'Purchased': [np.nan,'Yes', 'Yes', 'No', np.nan]})

df.fillna(df.mode().iloc[0]) ## convert df to a series
Country Purchased
0 France Yes
1 France Yes
2 France Yes
3 Spain No
4 France Yes

Upvotes: 3

Eduardo Passeto
Eduardo Passeto

Reputation: 17

Try something like: fill_mode = lambda col: col.fillna(col.mode()) and for the function: new_df = df.apply(fill_mode, axis=0)

Upvotes: -1

Audris Ločmelis
Audris Ločmelis

Reputation: 19

If we fill in the missing values with fillna(df['colX'].mode()), since the result of mode() is a Series, it will only fill in the first couple of rows for the matching indices. At least if done as below:

fill_mode = lambda col: col.fillna(col.mode())
df.apply(fill_mode, axis=0)

However, by simply taking the first value of the Series fillna(df['colX'].mode()[0]), I think we risk introducing unintended bias in the data. If the sample is multimodal, taking just the first mode value makes the already biased imputation method worse. For example, taking only 0 if we have [0, 21, 99] as the equally most frequent values. Or filling missing values with False when True and False values are equally frequent in a given column.

I don't have a clear cut solution here. Assigning a random value from all the local maxima could be one approach if using the mode is a necessity.

Upvotes: 1

simone
simone

Reputation: 107

Be careful, NaN may be the mode of your dataframe: in this case, you are replacing NaN with another NaN.

Upvotes: 9

zipa
zipa

Reputation: 27879

Just call first element of series:

data['Native Country'].fillna(data['Native Country'].mode()[0], inplace=True)

or you can do the same with assisgnment:

data['Native Country'] = data['Native Country'].fillna(data['Native Country'].mode()[0])

Upvotes: 71

Related Questions