Pilik
Pilik

Reputation: 709

Time between events (pandas)

I want to find the time elapsed between 2 events A and B. More specifically, whenever event A occurs, I want to know how long it takes before the next occurence of event B.

Take a look at this example:

import pandas as pd
import numpy as np

np.random.seed(5)
strings=list('AB')
data=[strings[i] for i in np.random.randint(0,2,15)]
index=pd.date_range('2/2/2012',periods=15,freq='T')
dfm=pd.DataFrame(data={'event':data},index=index)
dfm

                  event
2012-02-02 00:00:00 B
2012-02-02 00:01:00 A
2012-02-02 00:02:00 B
2012-02-02 00:03:00 B
2012-02-02 00:04:00 A
2012-02-02 00:05:00 A
2012-02-02 00:06:00 A
2012-02-02 00:07:00 B
2012-02-02 00:08:00 A
2012-02-02 00:09:00 A
2012-02-02 00:10:00 B
2012-02-02 00:11:00 A
2012-02-02 00:12:00 B
2012-02-02 00:13:00 A
2012-02-02 00:14:00 A

My expected output is:

begin           end
2012-02-02 00:01:00 2012-02-02 00:02:00 
2012-02-02 00:04:00 2012-02-02 00:07:00
2012-02-02 00:05:00 2012-02-02 00:07:00
2012-02-02 00:06:00 2012-02-02 00:07:00
2012-02-02 00:08:00 2012-02-02 00:10:00
2012-02-02 00:09:00 2012-02-02 00:10:00
2012-02-02 00:11:00 2012-02-02 00:12:00

My question is related to this one but is more complex since multiple A events can occur before a B event. I've thought quite a bit about a proper solution but haven't managed to come up with something that does the trick.

Upvotes: 3

Views: 1465

Answers (1)

unutbu
unutbu

Reputation: 879561

You could use searchsorted to find the indices where the start dates would be inserted into an array of end dates while maintaining the end dates in sorted order. This array of indices tells you which end date to associate with each start date.

import pandas as pd
import numpy as np

np.random.seed(5)
strings = list('AB')
data = [strings[i] for i in np.random.randint(0, 2, 15)]
index = pd.date_range('2/2/2012', periods=15, freq='T')
dfm = pd.DataFrame(data={'event': data}, index=index)


begin = dfm.loc[dfm['event'] == 'A'].index
cutoffs = dfm.loc[dfm['event'] == 'B'].index

idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]

result = pd.DataFrame({'begin':begin, 'end':end})

yields

                begin                 end
0 2012-02-02 00:01:00 2012-02-02 00:02:00
1 2012-02-02 00:04:00 2012-02-02 00:07:00
2 2012-02-02 00:05:00 2012-02-02 00:07:00
3 2012-02-02 00:06:00 2012-02-02 00:07:00
4 2012-02-02 00:08:00 2012-02-02 00:10:00
5 2012-02-02 00:09:00 2012-02-02 00:10:00
6 2012-02-02 00:11:00 2012-02-02 00:12:00

The DatetimeIndex.searchsorted method is not heavily documented, but it is mainly a thin wrapper around a call to numpy.searchsorted which can handle dates represented as numpy datetime64s.

Upvotes: 5

Related Questions