Reputation: 35
I am working my way through the resource Python for Data Science For Dummies. I am currently learning about imputing missing data values using pandas. Below is my code:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values='NaN',
strategy='mean')
# creates imputer to replace missing values.
# missing_values parameter defines what we are looking out for to impute.
# strategy parameter implies with what value you want to replace the missing value.
# strategy can be either: mean, median, most_frequent
imp.fit([[1, 2, 3, 4, 5, 6, 7]])
'''
Before imputing, we must provide stats for the imputer to use by calling fit().
'''
s = [[1, 2, 3, np.NaN, 5, 6, None]]
print(imp.transform(s))
x = pd.Series(imp.transform(s).tolist()[0]) # .transform() fills in the missing values in s
# we want to display the result as a series.
# from the imputer we want to transform our imputer output to a list using .tolist()
# after that we want to transform the list into a series by enclosing it within .Series()
print(x)
However, when I run the code, it returns an error at the line with imp.fit():
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-3b624663cf89> in <module>
15 # strategy can be either: mean, median, most_frequent
16
---> 17 imp.fit([[1, 2, 3, 4, 5, 6, 7]])
18 '''
19 Before imputing, we must provide stats for the imputer to use by calling fit().
/Applications/anaconda3/lib/python3.7/site-packages/sklearn/impute/_base.py in fit(self, X, y)
266 self : SimpleImputer
267 """
--> 268 X = self._validate_input(X)
269 super()._fit_indicator(X)
270
/Applications/anaconda3/lib/python3.7/site-packages/sklearn/impute/_base.py in _validate_input(self, X)
242 raise ve
243
--> 244 _check_inputs_dtype(X, self.missing_values)
245 if X.dtype.kind not in ("i", "u", "f", "O"):
246 raise ValueError("SimpleImputer does not support data with dtype "
/Applications/anaconda3/lib/python3.7/site-packages/sklearn/impute/_base.py in _check_inputs_dtype(X, missing_values)
26 " both numerical. Got X.dtype={} and "
27 " type(missing_values)={}."
---> 28 .format(X.dtype, type(missing_values)))
29
30
ValueError: 'X' and 'missing_values' types are expected to be both numerical. Got X.dtype=float64 and type(missing_values)=<class 'str'>.
Any help on the matter is greatly appreciated!
Also, wherever you are I hope that you are coping well with the COVID-19 situation!
Upvotes: 1
Views: 1598
Reputation: 17322
your parameter missing_values
has a string as a value 'NaN'
you can use:
missing_values = np.nan
Upvotes: 1