Reputation: 605

Function to replace NaN values in a dataframe with mean of the related column

EDIT: This question is not a clone of pandas dataframe replace nan values with average of columns because I want to replace the value of each column with the mean of the column and not with the mean of the dataframe values.

QUESTION

I have a pandas dataframe (train) with a hundred columns to which I have to apply Machine Learning techniques.

Usually I made feature engineering by hand but in this case I have a lot of columns to deal with.

I would like to build a Python function that:

1) Find the NaN values in each column (I have thought to df.isnull().any() )

2) For each NaN value, replace it with the mean of the column in which the NaN value has been found.

My idea was something like this:

def replace(value):
    for value in train:
        if train['value'].isnull():
           train['value'] = train['value'].fillna(train['value'].mean())

train = train.apply(replace,axis=1)

But I receive the following error

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3063             try:
-> 3064                 return self._engine.get_loc(key)
   3065             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'value'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-25-003b3eb2463c> in <module>()
----> 1 train = train.apply(replace,axis=1)

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6012                          args=args,
   6013                          kwds=kwds)
-> 6014         return op.get_result()
   6015 
   6016     def applymap(self, func):

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    140             return self.apply_raw()
    141 
--> 142         return self.apply_standard()
    143 
    144     def apply_empty_result(self):

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    246 
    247         # compute the result using the series generator
--> 248         self.apply_series_generator()
    249 
    250         # wrap results

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    275             try:
    276                 for i, v in enumerate(series_gen):
--> 277                     results[i] = self.f(v)
    278                     keys.append(v.name)
    279             except Exception as e:

<ipython-input-22-2e7fa654e765> in replace(value)
      1 def replace(value):
      2     for value in train:
----> 3         if train['value'].isnull():
      4            train['value'] = train['value'].fillna(df['value'].mean())

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3064                 return self._engine.get_loc(key)
   3065             except KeyError:
-> 3066                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3067 
   3068         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('value', 'occurred at index 0')

While searching for solutions, I found:

This but it works with a txt file (not a pandas dataframe)
This question about the df.isnull().any() method.

Upvotes: 3

Answers (3)

Quickbeam2k1

Reputation: 5437

You can also use fillna

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [2, np.nan, np.nan]})
df.fillna(df.mean(axis=0))
    A   B
0   1.0 2.0
1   2.0 2.0
2   1.5 2.0

df.mean(axis=0) computes the mean for every column, and this is passed to the fillna method.

This solution is on my machine, twice as fast as the solution using apply for the data set shown above.

Upvotes: 7

Paul-Darius

Reputation: 126

You can try something like:

[df[col].fillna(df[col].mean(), inplace=True) for col in df.columns]

But that is just a way to do it. Your code is a priori almost correct. Your error is that you should call

train[value]

Instead of :

train['value']

Everywhere in your code. Because the latter will try to look for a column named as "value" which is rather a variable from a list you are iterating on.

Upvotes: 4

zipa

Reputation: 27899

To fill NaN of each column with its respective mean use:

df.apply(lambda x: x.fillna(x.mean()))

Upvotes: 6

Function to replace NaN values in a dataframe with mean of the related column

Answers (3)

Related Questions