Reputation: 521
i have a dataframe with this kind of data :
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328
0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 84 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 50 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
the df shape is (10000, 329)
I would like to turn random 5% of 1
in the dataframe to 0
.
Is this possible?
Upvotes: 1
Views: 1775
Reputation: 31
Here is a more long winded solution where I print out the various steps.
Create the sample dataset with numpy. The dimensions and values will be altered from the question to make the response more clear. rawmat will be a 10 by 10 matrix of zeros and ones except for the first column which are larger values. Among the zeros and ones there is a 50 percent probability of a one being obtained.
import numpy as np
np.random.seed(1000)
rawmat = np.random.randint(2,size=(10,10))
# insert higher values in the first column
rawmat[:,0] = np.random.randint(low=5,high=9,size=10)
print(rawmat)
[[5 1 1 0 1 0 0 1 1 0]
[6 1 0 1 0 1 0 0 1 1]
[5 0 1 0 0 0 0 1 0 0]
[6 0 0 0 1 0 0 1 1 0]
[6 0 1 1 0 1 0 1 0 0]
[5 1 0 0 1 0 0 1 0 1]
[5 1 1 0 1 0 1 0 1 1]
[5 1 1 1 1 1 1 0 1 1]
[7 1 1 1 0 0 0 0 1 1]
[8 0 0 0 1 1 0 1 1 0]]
Out of 100 cells, 90 are now zero or one. In fact 46 are 1, which is reasonable given the 50 per cent probability.
np.count_nonzero(rawmat==1)
46
We can creat a mask where 50 percent of the relevant observations are true with randmask. However, the trick in this questions is to just focus on the ones, so we get this with rawones.
randmask = np.random.choice(a=[False, True], size=(10,10),p=[0.5,0.5])
rawones = np.where(rawmat==1,rawmat,0)
onefin = np.where(randmask,onemask,np.zeros((10,10),dtype=int))
Now the number of ones will drop by rougly half. Initially there were 46 ones in rawmat and now there are 23 in onefine.
np.count_nonzero(onefin==1)
23
The filtered ones can be recombined with the old data to get a matrix with half the ones.
finmat = np.where(rawmat==1,onefin,rawmat)
print(finmat)
[[5 0 0 0 0 0 0 1 1 0]
[6 1 0 1 0 0 0 0 1 0]
[5 0 1 0 0 0 0 1 0 0]
[6 0 0 0 0 0 0 1 0 0]
[6 0 0 1 0 0 0 0 0 0]
[5 1 0 0 1 0 0 0 0 1]
[5 0 1 0 1 0 1 0 0 0]
[5 1 0 0 1 1 0 0 1 1]
[7 1 1 0 0 0 0 0 0 0]
[8 0 0 0 0 1 0 0 0 0]]
Now we have the original matrix with the number of ones dropped by one half from 46 to 23.
np.count_nonzero(finmat==1)
23
Upvotes: 0
Reputation: 93161
Try this:
# Get all columns from 1 to 328 and stack them into a temp series
tmp = df.loc[:, 1:].stack()
# Get the 1s
ones = tmp[tmp == 1].values.astype('int8')
# Mix with 5% zeros. You can use ceil or floor here
# as long as it makes an integer
n_zero = np.ceil(ones.shape[0] * .05).astype('int')
# Make the 0s
zeros = np.zeros(n_zero, dtype='int8')
# Replace 5% of the 1s with 0s and shuffle them
noise = np.concatenate((ones[n_zero:], zeros))
np.random.shuffle(noise)
# Assign the noise back to `tmp`
tmp.loc[tmp == 1] = noise
# Assign the noise back to the orignal frame
df.loc[:, 1:] = tmp.unstack()
You can tell whether 5% of 1s has been replaced with 0s by summing the before and after frames:
# Run this before and after the last line above to verify
df.loc[:, 1:].values.sum()
Upvotes: 1