V P
V P

Reputation: 3924

How to remove random rows from pandas dataframe based on column entry?

I have a dataset of ~3700 rows and need to remove 1628 of those rows based on the column. The dataset looks like this:

compliance  day0  day1  day2  day3  day4
True        1     3     9     8     8
False       7     4     8     3     2
True        4     5     0     3     5
True        5     3     9     6     2

for 1068 rows I want to remove the entire row if compliance=true.

The thing is, I want to do this randomly; I don't want to remove the first 1063 rows. I tried this:

for z in range(1629):
    rand = random.randint(0,(3783-z)) #subtract z since dataframe shape is shrinking
    if str(data.iloc[rand,1]) == 'True':
        data = data.drop(balanced_dataset.index[rand])

But I'm getting the following error, after it removes a few rows:

 'labels [2359] not contained in axis'

I also tried this:

data.drop(data("adherence.str.startswith('T').values").sample(frac=.4).index)

frac is arbitrarily picked for now, I just wanted it to work. I got the following error:

'DataFrame' object is not callable

Any help would be greatly appreciated! Thank you

Upvotes: 1

Views: 5400

Answers (3)

Rajat Jain
Rajat Jain

Reputation: 2032

You can try:

df_dropped = df.drop(df.loc[df.compliance, :]).sample(n=fraction).index)

Upvotes: 0

cs95
cs95

Reputation: 402353

Use sample with drop:

n = 1068
# Do this first if you haven't already.
# df.compliance = df.compliance.map(pd.eval)
df_dropped = df.drop(df[df.compliance].sample(n=n).index)

For this to work, n will need to be strictly smaller than the filtered DataFrame.


Example dropping two rows randomly.

df.drop(df[df.compliance].sample(n=2).index)

   compliance  day0  day1  day2  day3  day4
1       False     7     4     8     3     2
3        True     5     3     9     6     2

Upvotes: 5

Tacratis
Tacratis

Reputation: 1055

This worked for me: you generate a list of the indices from which you want to remove element (in your case Compliance==True). Then you choose randomly (without replacement) from that list as many elements as you would like removed. Then you remove them from the DataFrame

to_remove = np.random.choice(data[data['Compliance']==True].index,size=1068,replace=False)
data.drop(to_remove)

Upvotes: 2

Related Questions