LucaVJ
LucaVJ

Reputation: 23

How to selectively drop x number of rows for repeating entries in a dataframe using python?

Imagine a dataframe, such as:

COL1             COL2                   COL3
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
bar              unique string          unique int             
bar              unique string          unique int
bar              unique string          unique int
bar              unique string          unique int
qux              unique string          unique int
qux              unique string          unique int
biz              unique string          unique int

I'd like to set a rule where I only keep a max of 3 (random/first) rows for repeating items in COL1. I don't really care which row I drop, so it can be random, so it's a conscious choice.

I'm unsure how best to solve this elegantly, without having to loop over my df. I've looked into groupby, sort & drop.duplicates but I'm not too sure this is the way to go.

My best guess atm, is to sort by COL1, add COL4 where I count items by COL1(group), and then drop all rows where COL4 exceeds 3. But I'm not sure how to count items in a 4th col... Is this the way to go? Any suggestions?

Expected Output:

COL1             COL2                   COL3
foo              unique string          unique int
foo              unique string          unique int
foo              unique string          unique int
bar              unique string          unique int             
bar              unique string          unique int
bar              unique string          unique int
qux              unique string          unique int
qux              unique string          unique int
biz              unique string          unique int

Upvotes: 1

Views: 41

Answers (1)

Ch3steR
Ch3steR

Reputation: 20659

You can use GroupBy.head here to take 1st three value from a group or GroupBy.tail to take last three.

df.groupby('COL1').head(3)

   COL1           COL2        COL3
0   foo  unique string  unique int
1   foo  unique string  unique int
2   foo  unique string  unique int
5   bar  unique string  unique int
6   bar  unique string  unique int
7   bar  unique string  unique int
9   qux  unique string  unique int
10  qux  unique string  unique int
11  biz  unique string  unique int

Upvotes: 1

Related Questions