Reputation: 23
Building off the question/solution here, I'm trying to set a parameter that will only remove consecutive duplicates if the same value occurs 5 (or more) times consecutively...
I'm able to apply the solution in the linked post which uses .shift()
to check if the previous (or a specified value in the past or future by adjusting the shift periods parameter) equals the current value, but how could I adjust this to check several consecutive values simultaneously?
Suppose a dataframe that looks like this:
x y
1 2
2 2
3 3
4 3
5 3
6 3
7 3
8 4
9 4
10 4
11 4
12 2
I'm trying to achieve this:
x y
1 2
2 2
3 3
8 4
9 4
10 4
11 4
12 2
Where we lose rows 4,5,6,7 because we found five consecutive 3's in the y column. But keep rows 1,2 because it we only find two consecutive 2's in the y column. Similarly, keep rows 8,9,10,11 because we only find four consecutive 4's in the y column.
Upvotes: 2
Views: 1829
Reputation: 26676
Not straight forward, I would go with @Quang Hoang
Create a column which gives the number of times a values is duplicated. In this case I used np.where()
and df.duplicated()
and assigned any count> 4
to be NaN
df['g']=np.where(df.groupby('y').transform(lambda x: x.duplicated(keep='last').count())>4, np.nan,1)
I then create two dataframes. One where I drop all the NaNs
and one with only NaNs
. In the one with NaNs
, I drop all apart from the last index using .last_valid_index()
. I then append them and sort by index using .sort_index()
. I use iloc[:,:2])
to slice out new column I created above
df.dropna().append(df.loc[df[df.g.isna()].last_valid_index()]).sort_index().iloc[:,:2]
x y
0 1.0 2.0
1 2.0 2.0
6 7.0 3.0
7 8.0 4.0
8 9.0 4.0
9 10.0 4.0
10 11.0 4.0
11 12.0 2.0
Upvotes: 0
Reputation: 150785
Let's try cumsum
on the differences to find the consecutive blocks. Then groupby().transform('size')
to get the size of the blocks:
thresh = 5
s = df['y'].diff().ne(0).cumsum()
small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()
df[small_size | first_rows]
Output:
x y
0 1 2
1 2 2
2 3 3
7 8 4
8 9 4
9 10 4
10 11 4
11 12 2
Upvotes: 2