Reputation: 367

How to replace NAN values based on the values in another column in pandas

I am using the breast-cancer-wisconsin dataset that looks as follows:

The Bare Nuclei column has 16 missing entries denoted by "?" which I replace with NAN as follows:

df.replace('?', np.NAN, regex=False, inplace = True)

resulting in this (a few of the 16 missing entries):

I want to replace the NANs with the most frequently occurring value with respect to each class. To elaborate, the most frequently occurring value in column 'Bare Nuclei' which has class=2 (benign cancer) should be used to replace all the rows that have 'Bare Nuclei' == NAN and Class == 2. Similarly for class = 4 (malignant).

I tried the following:

df[df['Class']== 2]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==2]['Bare Nuclei'].mode(), inplace=True)

df[df['Class']== 4]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==4]['Bare Nuclei'].mode(), inplace=True)

It did not result in any error but when I tried this:

df.isnull().any()

Bare Nuclei shows True which means the NAN values are still there.

(column "Bare Nuclei" is of type object)

I don't understand what I am doing wrong. Please help! Thank you.

Upvotes: 1

Answers (3)

William Giddens

Reputation: 1

file.info()
file['Bare Nuclei'].loc[file['Bare Nuclei'] == '?'] = panda.nan

file.dropna(inplace = True)
file.drop(['Sample code number'],axis = 1,inplace = True)
file['Bare Nuclei'] = file.astype({"Bare Nuclei": int})

from sklearn.metrics import accuracy_score
for i in range(num_split):
    first = filename.drop(['Class','Bare Nuclei'],axis=1)
    second = filename['Class'].values
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.8, random_state = 0)
    classifier = LogisticRegression(max_iter = 200, solver = 'newton-cg')
    classifier.fit(x_train, y_train)
    Sk_overall = Sk_overall + classifier.score(x_test,y_test)
    Sk_Accuracy = Sk_overall/i

Upvotes: 0

user2583808

Reputation: 950

As a late answer, if you want to replace every NaN you have in the "Bare Nuclei" column by the values in the column "Class":

selection_condition = pd.isna(df["Bare Nuclei"])
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]

If you you want to be class specific regarding your replacement:

selection_condition = pd.isna(df["Bare Nuclei"]) & (df["Class"] == 2)
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]

Upvotes: 0

Anurag Dabas

Reputation: 24322

You can try via groupby()+agg()+fillna():

s=df_vals.groupby('class')['Bare Nuclei'].agg(lambda x:x.mode(dropna=False).iat[0])
df['Bare Nuclei']=df['Bare Nuclei'].fillna(df['class'].map(s))

by your approach use loc:

df.loc[df['Class']== 2,'Bare Nuclei'].fillna(df_vals.loc[df_vals['Class']==2,'Bare Nuclei'].mode(), inplace=True)

Upvotes: 2

How to replace NAN values based on the values in another column in pandas

Answers (3)

Related Questions