Reputation: 221
I want to impute a column of a dataframe called Bare Nuclei with a median and I got this error ('must be str, not int', 'occurred at index Bare Nuclei') the following code represents the unique value of the column data['Bare Nuclei]
data['Bare Nuclei'].unique()
array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'],
dtype=object)
Then, I tried to replace ?
with nan
and then impute nan
with median but I got the above error
data['Bare Nuclei'] = data['Bare Nuclei'].replace('?',np.nan)
#data['Bare Nuclei'].fillna()
data.apply(lambda x: x.fillna(x.mean()),axis=0)
To check with the data is available in this link https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
Upvotes: 2
Views: 6588
Reputation: 11
Please check this function if you want to use medians and fill in a little more detailed and realistic.
def groupby_median_imputer(data,features_array,*args):
#unlimited groups
from tqdm import tqdm
print("The numbers of remaining missing values that columns have:")
for i in tqdm(features_array):
data[i] = data.groupby([*args])[i].apply(lambda x: x.fillna(x.median()))
print( i + " : " + data[i].isnull().sum().astype(str)) ```
Upvotes: 0
Reputation: 221
this is the Correction and it works
data['Bare Nuclei'] = data['Bare Nuclei'].replace('?',np.nan).astype(float)
data['Bare Nuclei'] = data['Bare Nuclei'].fillna((data['Bare Nuclei'].median()))
Upvotes: 0
Reputation: 4855
The error you got is because the values stored in the 'Bare Nuclei'
column are stored as strings, but the mean()
function requires numbers. You can see that they are strings in the result of your call to .unique()
.
After replacing the '?'
characters, you can convert the series to numbers using .astype(float)
:
data['Bare Nuclei'] = data['Bare Nuclei'].replace('?',np.nan)
data['Bare Nuclei'] = data['Bare Nuclei'].astype(float).apply(lambda x: x.fillna(x.mean()))
Upvotes: 1