Reputation: 65
I have a dataframe column with some numeric values. I want that these values get replaced by 1 and 0 based on a given condition. The condition is that if the value is above the mean of the column, then change the numeric value to 1, else set it to 0.
Here is the code I have now:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('data.csv')
dataset = dataset.dropna(axis=0, how='any')
X = dataset.drop(['myCol'], axis=1)
y = dataset.iloc[:, 4:5].values
mean_y = np.mean(dataset.myCol)
The target is the dataframe y. y is like so:
0
0 16
1 13
2 12.5
3 12
and so on. mean_y is equal to 3.55. Therefore, I need that all values greater than 3.55 to become ones, and the rest 0.
I applied this loop, but without success:
for i in dataset.myCol:
if dataset.myCol[i] > mean_y:
dataset.myCol[i] = 1
else:
dataset.myCol[i] = 0
The output is the following:
0
0 16
1 13
2 0
3 12
What am I doing wrong? Can someone please explain me the mistake?
Thank you!
Upvotes: 4
Views: 10812
Reputation: 862481
Convert boolean mask to integer - True
s to 1
and False
s to 0
:
print (dataset.myCol > mean_y)
0 True
1 False
2 False
3 False
Name: myCol, dtype: bool
dataset.myCol = (dataset.myCol > mean_y).astype(int)
print (dataset)
myCol
0 1
1 0
2 0
3 0
For your aproach, not recommended because slow need iterrows
for set values by columns and index values:
for i, x in dataset.iterrows():
if dataset.loc[i, 'myCol'] > mean_y:
dataset.loc[i, 'myCol'] = 1
else:
dataset.loc[i, 'myCol'] = 0
Upvotes: 2
Reputation: 210832
Try this vectorized approach:
dataset.myCol = np.where(dataset.myCol > dataset.myCol.mean(), 1, 0)
Upvotes: 6