Reputation: 3924
I have a dataset of ~3700 rows and need to remove 1628 of those rows based on the column. The dataset looks like this:
compliance day0 day1 day2 day3 day4
True 1 3 9 8 8
False 7 4 8 3 2
True 4 5 0 3 5
True 5 3 9 6 2
for 1068 rows I want to remove the entire row if compliance=true.
The thing is, I want to do this randomly; I don't want to remove the first 1063 rows. I tried this:
for z in range(1629):
rand = random.randint(0,(3783-z)) #subtract z since dataframe shape is shrinking
if str(data.iloc[rand,1]) == 'True':
data = data.drop(balanced_dataset.index[rand])
But I'm getting the following error, after it removes a few rows:
'labels [2359] not contained in axis'
I also tried this:
data.drop(data("adherence.str.startswith('T').values").sample(frac=.4).index)
frac is arbitrarily picked for now, I just wanted it to work. I got the following error:
'DataFrame' object is not callable
Any help would be greatly appreciated! Thank you
Upvotes: 1
Views: 5400
Reputation: 2032
You can try:
df_dropped = df.drop(df.loc[df.compliance, :]).sample(n=fraction).index)
Upvotes: 0
Reputation: 402353
Use sample
with drop
:
n = 1068
# Do this first if you haven't already.
# df.compliance = df.compliance.map(pd.eval)
df_dropped = df.drop(df[df.compliance].sample(n=n).index)
For this to work, n
will need to be strictly smaller than the filtered DataFrame.
Example dropping two rows randomly.
df.drop(df[df.compliance].sample(n=2).index)
compliance day0 day1 day2 day3 day4
1 False 7 4 8 3 2
3 True 5 3 9 6 2
Upvotes: 5
Reputation: 1055
This worked for me:
you generate a list of the indices from which you want to remove element (in your case Compliance==True
). Then you choose randomly (without replacement) from that list as many elements as you would like removed.
Then you remove them from the DataFrame
to_remove = np.random.choice(data[data['Compliance']==True].index,size=1068,replace=False)
data.drop(to_remove)
Upvotes: 2