Reputation:
I am learning machine learning and generated a pandas dataframe containing the following columns Id Category Cost_price Sold
. The shape of the dataframe is (100000, 4)
.
Here the target variable is the Sold column(1=Sold, 0=not sold)
. But no machine learning algorithm is able to get a good enough accuracy as all the columns in the dataframe is very random. To introduce a pattern to the dataframe I am trying to manipulate some of the values in the Sold column.
What i want to do is to change 6000 of the sold values to 1 where the cost_price is less than 800. But i am not able to do that.
I am new to machine learning and python. Please help me
Thanks in advance
Upvotes: 1
Views: 68
Reputation: 4233
IIUC use DataFrame.at
df.at[df.Sold[df.cost_price < 800][:6000].index, 'Sold'] = 1
If you randomly choose the rows use .sample
df.at[df[df.cost_price < 800].sample(6000).index, 'Sold'] = 1
Upvotes: 0
Reputation: 863801
Use:
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 6000, replace=False), 'Sold'] = 1
Sample:
df = pd.DataFrame({
'Sold':[1,0,0,1,1,0] * 3,
'cost_price':[500,300,6000,900,100,400] * 3,
})
print (df)
Sold cost_price
0 1 500
1 0 300
2 0 6000
3 1 900
4 1 100
5 0 400
6 1 500
7 0 300
8 0 6000
9 1 900
10 1 100
11 0 400
12 1 500
13 0 300
14 0 6000
15 1 900
16 1 100
17 0 400
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False), 'Sold'] = 1
print (df)
Sold cost_price
0 1 500
1 1 300
2 0 6000
3 1 900
4 1 100
5 1 400
6 1 500
7 1 300
8 0 6000
9 1 900
10 1 100
11 1 400
12 1 500
13 1 300
14 0 6000
15 1 900
16 1 100
17 1 400
Explanation:
First filter index values by condition with boolean indexing
:
print (df.index[df['cost_price'] < 800])
Int64Index([0, 1, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17], dtype='int64')
Then select random N values by numpy.random.choice
:
print (np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False))
[16 1 7 13 17 12 10 6 5 11]
And last set 1
by index values with DataFrame.loc
.
Upvotes: 1
Reputation: 2668
I will assume you will randomly choose those 6000 rows.
idx = df.Sold[df.Cost_price < 800].tolist()
r = random.sample(idx, 6000)
df.Sold.loc[r] = 1
Upvotes: 0