Reputation: 877
This is my DataFrame:
dt value
2000-01-01 00:00:00 296.0
2000-01-01 00:05:00 296.0
2000-01-01 00:10:00 296.0
2000-01-01 00:15:00 296.25
2000-01-01 00:20:00 296.25
2000-01-01 00:25:00 296.25
2000-01-01 00:30:00 296.25
2000-01-01 00:35:00 296.25
2000-01-01 00:40:00 296.25
2000-01-01 00:45:00 296.5
2000-01-01 00:50:00 296.5
2000-01-01 00:55:00 296.5
2000-01-01 01:00:00 296.5
2000-01-01 01:05:00 296.5
2000-01-01 01:10:00 296.5
2000-01-01 01:15:00 296.75
2000-01-01 01:20:00 296.75
2000-01-01 01:50:00 297.0
2000-01-01 01:55:00 297.0
2000-01-01 02:00:00 297.0
2000-01-01 02:05:00 297.0
2000-01-01 02:10:00 297.0
2000-01-01 02:15:00 297.0
I want to remove adjacent duplicates.
The duplicate in the middle should remain. If the number of duplicates is even, take the next one from the middle. If there are 2 duplicates, take second.
Expected output:
dt value
2000-01-01 00:05:00 296.0
2000-01-01 00:30:00 296.25
2000-01-01 01:00:00 296.5
2000-01-01 01:20:00 296.75
2000-01-01 02:05:00 297.0
I read this post about duplicates, but it doesn't satisfy my conditions of choosing the middle element.
Pandas: Drop consecutive duplicates
Upvotes: 4
Views: 228
Reputation: 59579
To avoid the slower groupby
+ apply(lambda ...)
we can instead use the built-in methods implemented in cython. Form groups of consecutive values checking against the shift
ed Series and then use aggregate those groups with size
and cumcount
to subset the original DataFrame. (Bonus: this preserves the original DataFrame index)
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
df[gp.transform('size')//2 == gp.cumcount()]
dt value
1 2000-01-01 00:05:00 296.00
6 2000-01-01 00:30:00 296.25
12 2000-01-01 01:00:00 296.50
16 2000-01-01 01:20:00 296.75
20 2000-01-01 02:05:00 297.00
To illustrate the relative performance here's an example. I create a test DataFrame where each value is repeated twice (just for simplicity) and we can see how these two methods scale with timings as a function of the number of groups (which is the important number for scaling of groupby + apply).
import perfplot
import pandas as pd
import numpy as np
def groupby_apply(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
return df.groupby(gp).apply(lambda x: x.iloc[len(x) // 2])
def groupby_builtin(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
return df[gp.transform('size')//2 == gp.cumcount()]
perfplot.show(
setup=lambda N: pd.DataFrame({'dt': range(N),
'value': np.repeat(np.arange(N//2), 2)}),
kernels=[
lambda df: groupby_apply(df),
lambda df: groupby_builtin(df)],
labels=['apply', 'built-in'],
n_range=[2 ** k for k in range(1, 20)],
equality_check=np.allclose,
xlabel='Proxy for # of Groups'
)
Upvotes: 4
Reputation: 9081
Use -
df.groupby(['value'])['dt'].apply(lambda x: x.iloc[math.floor(len(x)/2)])
Output
value
296.00 2000-01-01 00:05:00
296.25 2000-01-01 00:30:00
296.50 2000-01-01 01:00:00
296.75 2000-01-01 01:20:00
297.00 2000-01-01 02:05:00
Name: dt, dtype: object
Needless to say, you have to import math
Timings
@ALollz
3.82 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This one
2.09 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 6
Reputation: 9941
If you have blocks of values that may be repeated, and you want to get the middle of each block:
df['g'] = (df['value'] != df['value'].shift(1)).cumsum()
df.groupby('g').apply(lambda x: x.iloc[len(x) // 2])
Output:
dt value g
g
1 2000-01-01 00:05:00 296.00 1
2 2000-01-01 00:30:00 296.25 2
3 2000-01-01 01:00:00 296.50 3
4 2000-01-01 01:20:00 296.75 4
5 2000-01-01 02:05:00 297.00 5
Upvotes: 2