user8776508
user8776508

Reputation:

Changing values of a pandas dataframe based on other values in the dataframe

I am learning machine learning and generated a pandas dataframe containing the following columns Id Category Cost_price Sold. The shape of the dataframe is (100000, 4).

Here the target variable is the Sold column(1=Sold, 0=not sold). But no machine learning algorithm is able to get a good enough accuracy as all the columns in the dataframe is very random. To introduce a pattern to the dataframe I am trying to manipulate some of the values in the Sold column.

What i want to do is to change 6000 of the sold values to 1 where the cost_price is less than 800. But i am not able to do that.

I am new to machine learning and python. Please help me

Thanks in advance

Upvotes: 1

Views: 68

Answers (3)

Abhi
Abhi

Reputation: 4233

IIUC use DataFrame.at

df.at[df.Sold[df.cost_price < 800][:6000].index, 'Sold'] = 1

If you randomly choose the rows use .sample

df.at[df[df.cost_price < 800].sample(6000).index, 'Sold'] = 1

Upvotes: 0

jezrael
jezrael

Reputation: 863801

Use:

df.loc[np.random.choice(df.index[df['cost_price'] < 800], 6000, replace=False), 'Sold'] = 1

Sample:

df = pd.DataFrame({
         'Sold':[1,0,0,1,1,0] * 3,
         'cost_price':[500,300,6000,900,100,400] * 3,
})
print (df)
    Sold  cost_price
0      1         500
1      0         300
2      0        6000
3      1         900
4      1         100
5      0         400
6      1         500
7      0         300
8      0        6000
9      1         900
10     1         100
11     0         400
12     1         500
13     0         300
14     0        6000
15     1         900
16     1         100
17     0         400

df.loc[np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False), 'Sold'] = 1
print (df)
    Sold  cost_price
0      1         500
1      1         300
2      0        6000
3      1         900
4      1         100
5      1         400
6      1         500
7      1         300
8      0        6000
9      1         900
10     1         100
11     1         400
12     1         500
13     1         300
14     0        6000
15     1         900
16     1         100
17     1         400

Explanation:

First filter index values by condition with boolean indexing:

print (df.index[df['cost_price'] < 800])
Int64Index([0, 1, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17], dtype='int64')

Then select random N values by numpy.random.choice:

print (np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False))
[16  1  7 13 17 12 10  6  5 11]

And last set 1 by index values with DataFrame.loc.

Upvotes: 1

ipramusinto
ipramusinto

Reputation: 2668

I will assume you will randomly choose those 6000 rows.

idx = df.Sold[df.Cost_price < 800].tolist()
r = random.sample(idx, 6000)
df.Sold.loc[r] = 1

Upvotes: 0

Related Questions