Reputation: 605
EDIT: This question is not a clone of pandas dataframe replace nan values with average of columns because I want to replace the value of each column with the mean of the column and not with the mean of the dataframe values.
QUESTION
I have a pandas dataframe (train
) with a hundred columns to which I have to apply Machine Learning techniques.
Usually I made feature engineering by hand but in this case I have a lot of columns to deal with.
I would like to build a Python function that:
1) Find the NaN
values in each column (I have thought to df.isnull().any()
)
2) For each NaN
value, replace it with the mean of the column in which the NaN value has been found.
My idea was something like this:
def replace(value):
for value in train:
if train['value'].isnull():
train['value'] = train['value'].fillna(train['value'].mean())
train = train.apply(replace,axis=1)
But I receive the following error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3063 try:
-> 3064 return self._engine.get_loc(key)
3065 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'value'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-003b3eb2463c> in <module>()
----> 1 train = train.apply(replace,axis=1)
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6012 args=args,
6013 kwds=kwds)
-> 6014 return op.get_result()
6015
6016 def applymap(self, func):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
<ipython-input-22-2e7fa654e765> in replace(value)
1 def replace(value):
2 for value in train:
----> 3 if train['value'].isnull():
4 train['value'] = train['value'].fillna(df['value'].mean())
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
2686 return self._getitem_multilevel(key)
2687 else:
-> 2688 return self._getitem_column(key)
2689
2690 def _getitem_column(self, key):
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2693 # get column
2694 if self.columns.is_unique:
-> 2695 return self._get_item_cache(key)
2696
2697 # duplicate columns & possible reduce dimensionality
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2484 res = cache.get(item)
2485 if res is None:
-> 2486 values = self._data.get(item)
2487 res = self._box_item_values(item, values)
2488 cache[item] = res
/opt/conda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
4113
4114 if not isna(item):
-> 4115 loc = self.items.get_loc(item)
4116 else:
4117 indexer = np.arange(len(self.items))[isna(self.items)]
/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3064 return self._engine.get_loc(key)
3065 except KeyError:
-> 3066 return self._engine.get_loc(self._maybe_cast_indexer(key))
3067
3068 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('value', 'occurred at index 0')
While searching for solutions, I found:
This but it works with a txt file (not a pandas dataframe)
This question about the df.isnull().any()
method.
Upvotes: 3
Views: 14085
Reputation: 5437
You can also use fillna
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [2, np.nan, np.nan]})
df.fillna(df.mean(axis=0))
A B
0 1.0 2.0
1 2.0 2.0
2 1.5 2.0
df.mean(axis=0)
computes the mean for every column, and this is passed to the fillna method.
This solution is on my machine, twice as fast as the solution using apply for the data set shown above.
Upvotes: 7
Reputation: 126
You can try something like:
[df[col].fillna(df[col].mean(), inplace=True) for col in df.columns]
But that is just a way to do it. Your code is a priori almost correct. Your error is that you should call
train[value]
Instead of :
train['value']
Everywhere in your code. Because the latter will try to look for a column named as "value" which is rather a variable from a list you are iterating on.
Upvotes: 4
Reputation: 27899
To fill NaN
of each column with its respective mean use:
df.apply(lambda x: x.fillna(x.mean()))
Upvotes: 6