Reputation: 367
I am using the breast-cancer-wisconsin dataset that looks as follows:
The Bare Nuclei column has 16 missing entries denoted by "?" which I replace with NAN as follows:
df.replace('?', np.NAN, regex=False, inplace = True)
resulting in this (a few of the 16 missing entries):
I want to replace the NANs with the most frequently occurring value with respect to each class. To elaborate, the most frequently occurring value in column 'Bare Nuclei' which has class=2 (benign cancer) should be used to replace all the rows that have 'Bare Nuclei' == NAN and Class == 2. Similarly for class = 4 (malignant).
I tried the following:
df[df['Class']== 2]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==2]['Bare Nuclei'].mode(), inplace=True)
df[df['Class']== 4]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==4]['Bare Nuclei'].mode(), inplace=True)
It did not result in any error but when I tried this:
df.isnull().any()
Bare Nuclei shows True which means the NAN values are still there.
(column "Bare Nuclei" is of type object)
I don't understand what I am doing wrong. Please help! Thank you.
Upvotes: 1
Views: 2379
Reputation: 1
file.info()
file['Bare Nuclei'].loc[file['Bare Nuclei'] == '?'] = panda.nan
file.dropna(inplace = True)
file.drop(['Sample code number'],axis = 1,inplace = True)
file['Bare Nuclei'] = file.astype({"Bare Nuclei": int})
from sklearn.metrics import accuracy_score
for i in range(num_split):
first = filename.drop(['Class','Bare Nuclei'],axis=1)
second = filename['Class'].values
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.8, random_state = 0)
classifier = LogisticRegression(max_iter = 200, solver = 'newton-cg')
classifier.fit(x_train, y_train)
Sk_overall = Sk_overall + classifier.score(x_test,y_test)
Sk_Accuracy = Sk_overall/i
Upvotes: 0
Reputation: 950
As a late answer, if you want to replace every NaN you have in the "Bare Nuclei" column by the values in the column "Class":
selection_condition = pd.isna(df["Bare Nuclei"])
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]
If you you want to be class specific regarding your replacement:
selection_condition = pd.isna(df["Bare Nuclei"]) & (df["Class"] == 2)
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]
Upvotes: 0
Reputation: 24322
You can try via groupby()
+agg()
+fillna()
:
s=df_vals.groupby('class')['Bare Nuclei'].agg(lambda x:x.mode(dropna=False).iat[0])
df['Bare Nuclei']=df['Bare Nuclei'].fillna(df['class'].map(s))
OR
by your approach use loc
:
df.loc[df['Class']== 2,'Bare Nuclei'].fillna(df_vals.loc[df_vals['Class']==2,'Bare Nuclei'].mode(), inplace=True)
Upvotes: 2