Reputation: 23
Imagine a dataframe, such as:
COL1 COL2 COL3
foo unique string unique int
foo unique string unique int
foo unique string unique int
foo unique string unique int
foo unique string unique int
bar unique string unique int
bar unique string unique int
bar unique string unique int
bar unique string unique int
qux unique string unique int
qux unique string unique int
biz unique string unique int
I'd like to set a rule where I only keep a max of 3 (random/first) rows for repeating items in COL1. I don't really care which row I drop, so it can be random, so it's a conscious choice.
I'm unsure how best to solve this elegantly, without having to loop over my df. I've looked into groupby, sort & drop.duplicates but I'm not too sure this is the way to go.
My best guess atm, is to sort by COL1, add COL4 where I count items by COL1(group), and then drop all rows where COL4 exceeds 3. But I'm not sure how to count items in a 4th col... Is this the way to go? Any suggestions?
Expected Output:
COL1 COL2 COL3
foo unique string unique int
foo unique string unique int
foo unique string unique int
bar unique string unique int
bar unique string unique int
bar unique string unique int
qux unique string unique int
qux unique string unique int
biz unique string unique int
Upvotes: 1
Views: 41
Reputation: 20659
You can use GroupBy.head
here to take 1st three value from a group or GroupBy.tail
to take last three.
df.groupby('COL1').head(3)
COL1 COL2 COL3
0 foo unique string unique int
1 foo unique string unique int
2 foo unique string unique int
5 bar unique string unique int
6 bar unique string unique int
7 bar unique string unique int
9 qux unique string unique int
10 qux unique string unique int
11 biz unique string unique int
Upvotes: 1