Reputation: 514
I have a dataframe as follows :
df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
"value": [20, 17,15, 10, 8 , 18, 18, 17, 13, 10]})
Notice that the dataframe is sorted in descending order by user_id then value.
For each user_id, I would like to remove the 2nd and 4th row so the output would look like
df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'b', 'b', 'b',],
"value": [20, 15, 8 , 18, 17, 10]})
Inspired by drop first and last row from within each group, I tried the following :
def drop_rows(dataframe) :
pos = [1,3]
return dataframe.drop(dataframe.index[pos], inplace=True)
df.groupby('user_id').apply(drop_rows)
But got this "index 2 is out of bounds for axis 0 with size 0"
Could someone explain why this doesn't work and how I should proceed instead ? Also, given that the dataset is quite huge, an efficient approach to the solution would be helpful. Thanks a lot.
Upvotes: 3
Views: 755
Reputation: 75080
You can use groupby+cumcount
to get row count in each group then check if not the row is in the to_del
list
to_del = [2,4]
df[~df.groupby('user_id').cumcount().add(1).isin(to_del)]
user_id value
0 a 20
2 a 15
4 a 8
5 b 18
7 b 17
9 b 10
Upvotes: 4