Reputation: 519
How do I drop a limited number of rows? So far my code drops every instance I give. So in the example below, every instance of 'dog' is dropped. However, I would like to drop a specified number of instances, so for example only drop 2 instances of dog, it would also be a benefit if the instances to drop were sampled at random.
num = [10, 20, 30, 10, 40, 50, 20, 60, 70, 20]
color = ['red', 'white', 'black', 'green', 'white', 'orange', 'white', 'black', 'blue', 'red']
animal = ['dog', 'cat', 'raccoon', 'gecko', 'bear', 'raccoon', 'dog', 'goat', 'goat', 'dog']
dict = {'Number': num, 'Color': color, 'Animal': animal}
df = pd.DataFrame(dict)
to_drop = ['dog']
trimmed_df = df[~df['Animal'].isin(to_drop)]
Upvotes: 2
Views: 82
Reputation: 59519
If multiple animials and different amounts you can groupby
+ sample
. Store the animals and amounts in a dict, then resample the correct number.
This will drop at random and if you specify an N greater than the number of rows, it drops all of them for that animal
to_drop = {'dog': 2, 'raccoon': 1}
l = []
for animal, gp in df.groupby('Animal'):
l.append(gp.sample(n=max(0, len(gp)-to_drop.get(animal, 0)), replace=False))
pd.concat(l).sort_index()
Number Color Animal
1 20 white cat
3 10 green gecko
4 40 white bear
5 50 orange raccoon
7 60 black goat
8 70 blue goat
9 20 red dog
The above isn't very efficient, so leveraging @QuangHoang's clever idea to cumcount we first shuffle the entire DataFrame (.sample(frac=1)
) that way we randomly drop rows and then compare the cumcount with the cut-offs.
to_drop = {'dog': 2, 'raccoon': 1}
m = (df.sample(frac=1).groupby('Animal').cumcount()
.lt(df['Animal'].map(to_drop)))
df = df[~m]
Upvotes: 1
Reputation: 150725
You can try:
to_drop = ['dog']
s = df['Animal'].isin(to_drop)
mask = s & s.cumsum().le(2)
df[~mask]
Output:
Number Color Animal
1 20 white cat
2 30 black raccoon
3 10 green gecko
4 40 white bear
5 50 orange raccoon
7 60 black goat
8 70 blue goat
9 20 red dog
Update: In the case to_drop
has multiple labels and you want to drop 2 instance in each of to_drop
, you can do a groupby().cumcount()
:
mask = (df['Animal'].isin(to_drop) &
df.groupby('Animal').cumcount().lt(2)
)
print(df[~mask])
Upvotes: 2