Reputation: 566
I have multiple lists of time intervals and I need to find the time intervals (intersection) that are common to all of them.
E.g.
a = [['2018-02-03 15:06:30', '2018-02-03 17:06:30'], # each line is read as [start, end]
['2018-02-05 10:30:30', '2018-02-05 10:36:30'],
['2018-02-05 11:30:30', '2018-02-05 11:42:32']]
b = [['2018-02-03 15:16:30', '2018-02-03 18:06:30'],
['2018-02-04 10:30:30', '2018-02-05 10:32:30']]
c = [['2018-02-01 15:00:30', '2018-02-05 18:06:30']]
The result would be
common_intv = [['2018-02-03 15:16:30','2018-02-03 17:06:30'],
['2018-02-05 10:30:30','2018-02-05 10:32:30']]
I've found this solution that should work also for time intervals but I was wondering whether there is a more efficient way to do it in pandas.
The proposed solution in the link would process two lists at a time i.e. it would first find the common intervals between a
and b
, then put these common intervals inside a variable common
, then find the common intervals between common
and c
and so on...
Of course a global solution (considering all intervals at the same time) would be even better!
Upvotes: 1
Views: 334
Reputation: 148880
You can use pandas.merge_asof
in both directions to get a first selection and then carefully cleanup the resulting rows. Code could be:
# build the dataframes and ensure Timestamp types
dfa = pd.DataFrame(a, columns=['start', 'end']).astype('datetime64[ns]')
dfb = pd.DataFrame(b, columns=['start', 'end']).astype('datetime64[ns]')
dfc = pd.DataFrame(c, columns=['start', 'end']).astype('datetime64[ns]')
# merge a and b
tmp = pd.concat([pd.merge_asof(dfa, dfb, on='start'),
pd.merge_asof(dfb, dfa, on='start')]
).sort_values('start').dropna()
# keep the minimum end and ensure end <= start
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
# merge c
tmp = pd.concat([pd.merge_asof(tmp, dfc, on='start'),
pd.merge_asof(dfc, tmp, on='start')]
).sort_values('start').dropna()
tmp = tmp.assign(end=np.minimum(tmp.end_x, tmp.end_y))[['start', 'end']]
tmp = tmp[tmp['start'] <= tmp['end']]
It gives as expected:
start end
0 2018-02-03 15:16:30 2018-02-03 17:06:30
1 2018-02-05 10:30:30 2018-02-05 10:32:30
Upvotes: 2