Reputation: 6381
Let's assume there're two dataframes, recording two different operations (who at which time did the operation) respectively.
# operation A
df1 = pd.DataFrame([['a', 138], ['a', 140], ['b', 150]], columns=['name', 'timestamp'])
print df1
# operation B
df2 = pd.DataFrame([['a', 145], ['b', 170]], columns=['name', 'timestamp'])
print df2
--output--
name timestamp
0 a 138
1 a 140
2 b 150
name timestamp
0 a 145
1 b 170
What I intend to do is to retrieve consecutive operations A->B where B happens within 10 units against previous A happening time.
In the example above, (B, a, 145) and (A, a, 140) will be a qualified consecutive operations, yet (B, b, 170) and (A, b, 150) will not be since 170-150=20>10 units.
Could anyone give me the most efficient way to processing this model is the data is relatively large (10,000-100,000)? Many thanks.
Upvotes: 1
Views: 61
Reputation: 1719
First you would create a new column that is the difference between consecutive timestamps
df1.loc[:, 'time_diff'] = df1.timestamp.diff()
df1
name timestamp time_diff
0 a 138 NaN
1 a 140 2
2 b 150 10
df2.loc[:, 'time_diff'] = df2.timestamp.diff()
then you want to see when name
transitions. you can do this using duplicated
. You also want to ignore the first row (so when time_diff is null). This will tell you the rows where a transition happened, and you can examine the time_diff column however you want
df1.loc[~df1.name.duplicated() & df1.time_diff.notnull(),:]
df1
name timestamp time_diff
2 b 150 10
vs.
df2.loc[~df2.name.duplicated() & df2.time_diff.notnull(),:]
df2
name timestamp time_diff
1 b 170 25
so if you have tons of rows, you can have some logic like
df.loc[~df.name.duplicated() & df.time_diff.notnull() & (df.time_diff <= 10)]
Upvotes: 1