Reputation: 954
I am trying to use Pandas to eliminate some near duplicates in a data frame based on the difference in a column representing time in seconds. For example:
import pandas as pd, numpy as np
df=pd.DataFrame([1200,1201,1233,1555,1650,5561,5562],columns=['Time'])
df['Dif']=df.Time.diff()
df['Coef']=np.random.rand(len(df))
so what I need to do is examine each group that has time values that occur within 2 seconds of each other, choose the one with the highest value in Coef and discard the rest. So in this example I would somehow group index 0 and 1 together and discard index 0 (because df.Coef[0] < df.Coef1 ).
Likewise, index 5,6, and 7 would be grouped togther and all but index 6 discarded. so the desired output would be df.drop([0,5,7]):
I currently have a python while loop algorithm to do this but the data frame can contain millions of indicies so it is much too slow. Any pure pandas solution would be much appreciated
Upvotes: 4
Views: 1587
Reputation: 375445
You could do a groupby here, by enumerating the groups:
In [11]: (df['Time'].diff() > 2).cumsum()
Out[11]:
0 0
1 0
2 1
3 2
4 3
5 4
6 4
Name: Time, dtype: int64
Note: if this was a datetime column rather than 2 you'd want to compare to a timedelta.
In [12]: g = df.groupby((df.Time.diff() > 2).cumsum())
Now you can use the idxmax (the index with maximal element) for the Coeff column on each group:
In [13]: g.Coef.idxmax()
Out[13]:
Time
0 1
1 2
2 3
3 4
4 5
Name: Coef, dtype: int64
and select these rows:
In [14]: df.loc[g.Coef.idxmax()] # results will vary since we've used a random df
Out[14]:
Time Dif Coef
1 1201 1 0.760751
2 1233 32 0.501199
3 1555 322 0.473628
4 1650 95 0.371059
5 5561 3911 0.917556
Upvotes: 6