derchambers
derchambers

Reputation: 954

Pandas to form clusters based on diff column

I am trying to use Pandas to eliminate some near duplicates in a data frame based on the difference in a column representing time in seconds. For example:

import pandas as pd, numpy as np
df=pd.DataFrame([1200,1201,1233,1555,1650,5561,5562],columns=['Time'])
df['Dif']=df.Time.diff()
df['Coef']=np.random.rand(len(df))

enter image description here

so what I need to do is examine each group that has time values that occur within 2 seconds of each other, choose the one with the highest value in Coef and discard the rest. So in this example I would somehow group index 0 and 1 together and discard index 0 (because df.Coef[0] < df.Coef1 ).

Likewise, index 5,6, and 7 would be grouped togther and all but index 6 discarded. so the desired output would be df.drop([0,5,7]):

enter image description here

I currently have a python while loop algorithm to do this but the data frame can contain millions of indicies so it is much too slow. Any pure pandas solution would be much appreciated

Upvotes: 4

Views: 1587

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375445

You could do a groupby here, by enumerating the groups:

In [11]: (df['Time'].diff() > 2).cumsum()
Out[11]:
0    0
1    0
2    1
3    2
4    3
5    4
6    4
Name: Time, dtype: int64

Note: if this was a datetime column rather than 2 you'd want to compare to a timedelta.

In [12]: g = df.groupby((df.Time.diff() > 2).cumsum())

Now you can use the idxmax (the index with maximal element) for the Coeff column on each group:

In [13]: g.Coef.idxmax()
Out[13]:
Time
0       1
1       2
2       3
3       4
4       5
Name: Coef, dtype: int64

and select these rows:

In [14]: df.loc[g.Coef.idxmax()]  # results will vary since we've used a random df
Out[14]:
   Time   Dif      Coef
1  1201     1  0.760751
2  1233    32  0.501199
3  1555   322  0.473628
4  1650    95  0.371059
5  5561  3911  0.917556

Upvotes: 6

Related Questions