sacuL
sacuL

Reputation: 51425

Efficiently find overlap of date-time ranges from 2 dataframes

There are some questions out there regarding finding the overlap in date or time ranges (for example). I've used these to solve my problem, but I've ended up with an extremely slow (and not-at-all elegant) solution to my problem. I would really appreciate it if someone has an idea of how to make this faster (and more elegant):

The Problem:

I've got 2 dataframes, df1 and df2, each with 2 columns that represent a start time and an end time:

>>> df1

        datetime_start        datetime_end
0  2016-09-11 06:00:00 2016-09-11 06:30:00
1  2016-09-11 07:00:00 2016-09-11 07:30:00
2  2016-09-11 07:30:00 2016-09-11 08:00:00
3  2016-09-11 08:00:00 2016-09-11 08:30:00
4  2016-09-11 08:30:00 2016-09-11 09:00:00
5  2016-09-11 09:00:00 2016-09-11 09:30:00
6  2016-09-11 09:30:00 2016-09-11 10:00:00
7  2016-09-11 10:30:00 2016-09-11 11:00:00
13 2016-09-11 14:00:00 2016-09-11 14:30:00
14 2016-09-11 14:30:00 2016-09-11 15:00:00
15 2016-09-11 15:00:00 2016-09-11 15:30:00
16 2016-09-11 15:30:00 2016-09-11 16:00:00
17 2016-09-11 16:00:00 2016-09-11 16:30:00
18 2016-09-11 16:30:00 2016-09-11 17:00:00
19 2016-09-11 17:00:00 2016-09-11 17:30:00

>>> df2

        datetime_start        datetime_end catg
4  2016-09-11 08:48:33 2016-09-11 09:41:53    a
6  2016-09-11 09:54:25 2016-09-11 10:00:50    a
8  2016-09-11 10:01:47 2016-09-11 10:04:55    b
10 2016-09-11 10:08:00 2016-09-11 10:08:11    b
12 2016-09-11 10:30:28 2016-09-11 10:30:28    b
14 2016-09-11 10:38:18 2016-09-11 10:38:18    a
18 2016-09-11 13:44:05 2016-09-11 13:44:05    a
20 2016-09-11 13:46:52 2016-09-11 14:11:41    d
23 2016-09-11 14:22:17 2016-09-11 14:33:40    b
25 2016-09-11 15:00:12 2016-09-11 15:02:55    b
27 2016-09-11 15:04:19 2016-09-11 15:06:36    b
29 2016-09-11 15:08:43 2016-09-11 15:31:29    d
31 2016-09-11 15:38:04 2016-09-11 16:09:24    a
33 2016-09-11 16:18:40 2016-09-11 16:44:32    b
35 2016-09-11 16:45:59 2016-09-11 16:59:01    b
37 2016-09-11 17:08:31 2016-09-11 17:12:23    b
39 2016-09-11 17:16:13 2016-09-11 17:16:33    c
41 2016-09-11 17:17:23 2016-09-11 17:20:00    b
45 2016-09-13 12:27:59 2016-09-13 12:34:21    a
47 2016-09-13 12:38:39 2016-09-13 12:38:45    a

What I want is to find where the ranges in df2 have overlap with the ranges in df1, how long that overlap is (in seconds), and what value of df2.catg that is. I want the length of that overlap inserted into a column in df1 (that column will be named for the catg it represents).

Desired output:

>>> df1
        datetime_start        datetime_end       a       b       d     c
0  2016-09-11 06:00:00 2016-09-11 06:30:00     0.0     0.0     0.0   0.0
1  2016-09-11 07:00:00 2016-09-11 07:30:00     0.0     0.0     0.0   0.0
2  2016-09-11 07:30:00 2016-09-11 08:00:00     0.0     0.0     0.0   0.0
3  2016-09-11 08:00:00 2016-09-11 08:30:00     0.0     0.0     0.0   0.0
4  2016-09-11 08:30:00 2016-09-11 09:00:00   687.0     0.0     0.0   0.0
5  2016-09-11 09:00:00 2016-09-11 09:30:00  1800.0     0.0     0.0   0.0
6  2016-09-11 09:30:00 2016-09-11 10:00:00  1048.0     0.0     0.0   0.0
7  2016-09-11 10:30:00 2016-09-11 11:00:00     0.0     0.0     0.0   0.0
13 2016-09-11 14:00:00 2016-09-11 14:30:00     0.0   463.0   701.0   0.0
14 2016-09-11 14:30:00 2016-09-11 15:00:00     0.0   220.0     0.0   0.0
15 2016-09-11 15:00:00 2016-09-11 15:30:00     0.0   300.0  1277.0   0.0
16 2016-09-11 15:30:00 2016-09-11 16:00:00  1316.0     0.0    89.0   0.0
17 2016-09-11 16:00:00 2016-09-11 16:30:00   564.0   680.0     0.0   0.0
18 2016-09-11 16:30:00 2016-09-11 17:00:00     0.0  1654.0     0.0   0.0
19 2016-09-11 17:00:00 2016-09-11 17:30:00     0.0   389.0     0.0  20.0

The ridiculously slow way to do this:

Based on this beautiful answer, I've achieved the goals I want using the following hard to follow code:

from collections import namedtuple
Range = namedtuple('Range', ['start', 'end'])

def overlap(row1, row2):
    r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
    r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
    latest_start = max(r1.start, r2.start)
    earliest_end = min(r1.end, r2.end)
    delta = (earliest_end - latest_start).total_seconds()
    overlap = max(0, delta)
    return overlap

for cat in df2.catg.unique().tolist():
    df1[cat] = 0

for idx1, row1 in df1.iterrows():
    for idx2, row2 in df2.iterrows():
        if overlap(row1, row2) > 0:
            df1.loc[idx1, row2.catg] += overlap(row1, row2)

This works, but is soooo slow on larger dataframes that it's basically un-useable. If anyone has any ideas to speed this up, I'd love your input.

Thanks in advance, and let me know if something is unclear!

dataframe setup:

import pandas as pd
from pandas import Timestamp

d1 = {'datetime_start': {0: Timestamp('2016-09-11 06:00:00'), 1: Timestamp('2016-09-11 07:00:00'), 2: Timestamp('2016-09-11 07:30:00'), 3: Timestamp('2016-09-11 08:00:00'), 4: Timestamp('2016-09-11 08:30:00'), 5: Timestamp('2016-09-11 09:00:00'), 6: Timestamp('2016-09-11 09:30:00'), 7: Timestamp('2016-09-11 10:30:00'), 13: Timestamp('2016-09-11 14:00:00'), 14: Timestamp('2016-09-11 14:30:00'), 15: Timestamp('2016-09-11 15:00:00'), 16: Timestamp('2016-09-11 15:30:00'), 17: Timestamp('2016-09-11 16:00:00'), 18: Timestamp('2016-09-11 16:30:00'), 19: Timestamp('2016-09-11 17:00:00')}, 'datetime_end': {0: Timestamp('2016-09-11 06:30:00'), 1: Timestamp('2016-09-11 07:30:00'), 2: Timestamp('2016-09-11 08:00:00'), 3: Timestamp('2016-09-11 08:30:00'), 4: Timestamp('2016-09-11 09:00:00'), 5: Timestamp('2016-09-11 09:30:00'), 6: Timestamp('2016-09-11 10:00:00'), 7: Timestamp('2016-09-11 11:00:00'), 13: Timestamp('2016-09-11 14:30:00'), 14: Timestamp('2016-09-11 15:00:00'), 15: Timestamp('2016-09-11 15:30:00'), 16: Timestamp('2016-09-11 16:00:00'), 17: Timestamp('2016-09-11 16:30:00'), 18: Timestamp('2016-09-11 17:00:00'), 19: Timestamp('2016-09-11 17:30:00')}}

d2 = {'datetime_start': {4: Timestamp('2016-09-11 08:48:33'), 6: Timestamp('2016-09-11 09:54:25'), 8: Timestamp('2016-09-11 10:01:47'), 10: Timestamp('2016-09-11 10:08:00'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 13:46:52'), 23: Timestamp('2016-09-11 14:22:17'), 25: Timestamp('2016-09-11 15:00:12'), 27: Timestamp('2016-09-11 15:04:19'), 29: Timestamp('2016-09-11 15:08:43'), 31: Timestamp('2016-09-11 15:38:04'), 33: Timestamp('2016-09-11 16:18:40'), 35: Timestamp('2016-09-11 16:45:59'), 37: Timestamp('2016-09-11 17:08:31'), 39: Timestamp('2016-09-11 17:16:13'), 41: Timestamp('2016-09-11 17:17:23'), 45: Timestamp('2016-09-13 12:27:59'), 47: Timestamp('2016-09-13 12:38:39')}, 'datetime_end': {4: Timestamp('2016-09-11 09:41:53'), 6: Timestamp('2016-09-11 10:00:50'), 8: Timestamp('2016-09-11 10:04:55'), 10: Timestamp('2016-09-11 10:08:11'), 12: Timestamp('2016-09-11 10:30:28'), 14: Timestamp('2016-09-11 10:38:18'), 18: Timestamp('2016-09-11 13:44:05'), 20: Timestamp('2016-09-11 14:11:41'), 23: Timestamp('2016-09-11 14:33:40'), 25: Timestamp('2016-09-11 15:02:55'), 27: Timestamp('2016-09-11 15:06:36'), 29: Timestamp('2016-09-11 15:31:29'), 31: Timestamp('2016-09-11 16:09:24'), 33: Timestamp('2016-09-11 16:44:32'), 35: Timestamp('2016-09-11 16:59:01'), 37: Timestamp('2016-09-11 17:12:23'), 39: Timestamp('2016-09-11 17:16:33'), 41: Timestamp('2016-09-11 17:20:00'), 45: Timestamp('2016-09-13 12:34:21'), 47: Timestamp('2016-09-13 12:38:45')}, 'catg': {4: 'a', 6: 'a', 8: 'b', 10: 'b', 12: 'b', 14: 'a', 18: 'a', 20: 'd', 23: 'b', 25: 'b', 27: 'b', 29: 'd', 31: 'a', 33: 'b', 35: 'b', 37: 'b', 39: 'c', 41: 'b', 45: 'a', 47: 'a'}}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

Upvotes: 10

Views: 6261

Answers (3)

cmaher
cmaher

Reputation: 5225

Based on timeit tests, with 100 executions each, the namedtuple approach in the question averaged 15.7314 seconds on my machine, vs. an average of 1.4794 seconds with this approach:

# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values

# create a numpy array with one timestamp for each second 
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
                np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')

df1.merge(pd.DataFrame({'datetime_start':seconds_range,
                        'catg':np.repeat(df2.catg, duration)}). \
              groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
              size(). \
              unstack(level=0). \
              reset_index(), 
          how="left")

#           datetime_end      datetime_start       a       b     c       d
# 0  2016-09-11 06:30:00 2016-09-11 06:00:00     NaN     NaN   NaN     NaN
# 1  2016-09-11 07:30:00 2016-09-11 07:00:00     NaN     NaN   NaN     NaN
# 2  2016-09-11 08:00:00 2016-09-11 07:30:00     NaN     NaN   NaN     NaN
# 3  2016-09-11 08:30:00 2016-09-11 08:00:00     NaN     NaN   NaN     NaN
# 4  2016-09-11 09:00:00 2016-09-11 08:30:00   687.0     NaN   NaN     NaN
# 5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800.0     NaN   NaN     NaN
# 6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048.0     NaN   NaN     NaN
# 7  2016-09-11 11:00:00 2016-09-11 10:30:00     NaN     NaN   NaN     NaN
# 8  2016-09-11 14:30:00 2016-09-11 14:00:00     NaN   463.0   NaN   701.0
# 9  2016-09-11 15:00:00 2016-09-11 14:30:00     NaN   220.0   NaN     NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00     NaN   300.0   NaN  1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00  1316.0     NaN   NaN    89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00   564.0   680.0   NaN     NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00     NaN  1654.0   NaN     NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00     NaN   389.0  20.0     NaN

Upvotes: 4

jpp
jpp

Reputation: 164823

You should see a significant (~8x in my testing) performance improvement through a few changes. The structure of your code remains the same:

def overlap(row1, row2):
    return max(0, (min(row1[0], row2[0]) - max(row1[1], row2[1])) / np.timedelta64(1, 's'))

df1 = df1.join(pd.DataFrame(dict.fromkeys(df2.catg.unique(), 0), index=df1.index))

for idx1, row1 in enumerate(df1.iloc[:, :2].values):
    for catg, row2 in zip(df2['catg'], df2.iloc[:, 1:3].values):
        df1.iat[idx1, df1.columns.get_loc(catg)] += overlap(row1, row2)

You can get this down further via numba, or doing some clever pandas stuff which hides all your logic.

Explanation

  1. Use df.itertuples instead of df.iterrows
  2. Use df.iat instead of df.loc
  3. Use numpy instead of pandas time objects
  4. Remove named tuple creation
  5. Remove duplicate overlap calculation
  6. Improve overlap algorithm

Result

          datetime_end      datetime_start     a     b   c     d
0  2016-09-11 06:30:00 2016-09-11 06:00:00     0     0   0     0
1  2016-09-11 07:30:00 2016-09-11 07:00:00     0     0   0     0
2  2016-09-11 08:00:00 2016-09-11 07:30:00     0     0   0     0
3  2016-09-11 08:30:00 2016-09-11 08:00:00     0     0   0     0
4  2016-09-11 09:00:00 2016-09-11 08:30:00   687     0   0     0
5  2016-09-11 09:30:00 2016-09-11 09:00:00  1800     0   0     0
6  2016-09-11 10:00:00 2016-09-11 09:30:00  1048     0   0     0
7  2016-09-11 11:00:00 2016-09-11 10:30:00     0     0   0     0
13 2016-09-11 14:30:00 2016-09-11 14:00:00     0   463   0   701
14 2016-09-11 15:00:00 2016-09-11 14:30:00     0   220   0     0
15 2016-09-11 15:30:00 2016-09-11 15:00:00     0   300   0  1277
16 2016-09-11 16:00:00 2016-09-11 15:30:00  1316     0   0    89
17 2016-09-11 16:30:00 2016-09-11 16:00:00   564   680   0     0
18 2016-09-11 17:00:00 2016-09-11 16:30:00     0  1654   0     0
19 2016-09-11 17:30:00 2016-09-11 17:00:00     0   389  20     0

Upvotes: 2

azalea
azalea

Reputation: 12640

Assuming both df1 and df2 are sorted in ascending order by the datetime_start column (it appears so), then you just need to go through each row of the two dataframes once, resulting in an O(n) running time, rather than the current O(n^2) due to pairwise row comparison.

The following code illustrates the idea. The key point is using iterators it1 and it2 to point to the current row. Since the dataframes are sorted, if row1 is already later in time than row2, we are sure that the next row in df1 is later than row2. Harder to explain in words than code:

def main(df1, df2):
    for cat in df2.catg.unique().tolist():
        df1[cat] = 0
    it1 = df1.iterrows()
    it2 = df2.iterrows()
    idx1, row1 = next(it1)
    idx2, row2 = next(it2)
    while True:
        try:
            r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
            r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
            if r2.end < r1.start:
                # no overlap. r2 before r1. advance it2
                idx2, row2 = next(it2)
            elif r1.end < r2.start:
                # no overlap. r1 before r2. advance it1
                idx1, row1 = next(it1)
            else:
                # overlap. overlap(row1, row2) must > 0 
                df1.loc[idx1, row2.catg] += overlap(row1, row2)
                # determine whether to advance it1 or it2
                if r1.end < r2.end:
                    # advance it1
                    idx1, row1 = next(it1)
                else:
                    # advance it2
                    idx2, row2 = next(it2)
        except StopIteration:
            break

main(df1, df2)

Upvotes: 3

Related Questions