Reputation: 709
I want to find the time elapsed between 2 events A and B. More specifically, whenever event A occurs, I want to know how long it takes before the next occurence of event B.
Take a look at this example:
import pandas as pd
import numpy as np
np.random.seed(5)
strings=list('AB')
data=[strings[i] for i in np.random.randint(0,2,15)]
index=pd.date_range('2/2/2012',periods=15,freq='T')
dfm=pd.DataFrame(data={'event':data},index=index)
dfm
event
2012-02-02 00:00:00 B
2012-02-02 00:01:00 A
2012-02-02 00:02:00 B
2012-02-02 00:03:00 B
2012-02-02 00:04:00 A
2012-02-02 00:05:00 A
2012-02-02 00:06:00 A
2012-02-02 00:07:00 B
2012-02-02 00:08:00 A
2012-02-02 00:09:00 A
2012-02-02 00:10:00 B
2012-02-02 00:11:00 A
2012-02-02 00:12:00 B
2012-02-02 00:13:00 A
2012-02-02 00:14:00 A
My expected output is:
begin end
2012-02-02 00:01:00 2012-02-02 00:02:00
2012-02-02 00:04:00 2012-02-02 00:07:00
2012-02-02 00:05:00 2012-02-02 00:07:00
2012-02-02 00:06:00 2012-02-02 00:07:00
2012-02-02 00:08:00 2012-02-02 00:10:00
2012-02-02 00:09:00 2012-02-02 00:10:00
2012-02-02 00:11:00 2012-02-02 00:12:00
My question is related to this one but is more complex since multiple A events can occur before a B event. I've thought quite a bit about a proper solution but haven't managed to come up with something that does the trick.
Upvotes: 3
Views: 1465
Reputation: 879561
You could use searchsorted
to find the indices where the start dates would be inserted into an array of end dates
while maintaining the end dates in sorted order. This array of indices tells you which end date to associate with each start date.
import pandas as pd
import numpy as np
np.random.seed(5)
strings = list('AB')
data = [strings[i] for i in np.random.randint(0, 2, 15)]
index = pd.date_range('2/2/2012', periods=15, freq='T')
dfm = pd.DataFrame(data={'event': data}, index=index)
begin = dfm.loc[dfm['event'] == 'A'].index
cutoffs = dfm.loc[dfm['event'] == 'B'].index
idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]
result = pd.DataFrame({'begin':begin, 'end':end})
yields
begin end
0 2012-02-02 00:01:00 2012-02-02 00:02:00
1 2012-02-02 00:04:00 2012-02-02 00:07:00
2 2012-02-02 00:05:00 2012-02-02 00:07:00
3 2012-02-02 00:06:00 2012-02-02 00:07:00
4 2012-02-02 00:08:00 2012-02-02 00:10:00
5 2012-02-02 00:09:00 2012-02-02 00:10:00
6 2012-02-02 00:11:00 2012-02-02 00:12:00
The DatetimeIndex.searchsorted
method
is not heavily documented, but it is mainly a thin wrapper around a call to numpy.searchsorted
which can handle dates represented as numpy datetime64
s.
Upvotes: 5