Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Question

Building off the question/solution here, I'm trying to set a parameter that will only remove consecutive duplicates if the same value occurs 5 (or more) times consecutively...

I'm able to apply the solution in the linked post which uses .shift() to check if the previous (or a specified value in the past or future by adjusting the shift periods parameter) equals the current value, but how could I adjust this to check several consecutive values simultaneously?

Suppose a dataframe that looks like this:

I'm trying to achieve this:

Where we lose rows 4,5,6,7 because we found five consecutive 3's in the y column. But keep rows 1,2 because it we only find two consecutive 2's in the y column. Similarly, keep rows 8,9,10,11 because we only find four consecutive 4's in the y column.

Quang Hoang · Accepted Answer

Let's try cumsum on the differences to find the consecutive blocks. Then groupby().transform('size') to get the size of the blocks:

thresh = 5
s = df['y'].diff().ne(0).cumsum()

small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()

df[small_size | first_rows]

Output:

Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Answers (2)

Related Questions