How can I get the total amount of time overlap between Dataframes?

Question

Say I have two pandas dataframes:

import pandas as pd

df1 = pd.DataFrame(
    {
        "Start": {
            0: "2019-07-19 07:00:00",
            1: "2019-07-19 08:00:00",
            2: "2019-07-19 10:00:00",
        },
        "Finish": {
            0: "2019-07-19 07:30:00",
            1: "2019-07-19 08:30:00",
            2: "2019-07-19 10:30:00",
        },
    }
)

df2 = pd.DataFrame(
    {
        "Start": {0: "2019-07-19 07:30:00", 1: "2019-07-19 08:15:00",},
        "Finish": {0: "2019-07-19 08:00:00", 1: "2019-07-19 09:00:00",},
    }
)
df1.Start = pd.to_datetime(df1.Start)
df2.Finish = pd.to_datetime(df2.Finish)

They look like this:

|    | Start               | Finish              |
|---:|:--------------------|:--------------------|
|  0 | 2019-07-19 07:00:00 | 2019-07-19 07:30:00 |
|  1 | 2019-07-19 08:00:00 | 2019-07-19 08:30:00 |
|  2 | 2019-07-19 10:00:00 | 2019-07-19 10:30:00 |

|    | Start               | Finish              |
|---:|:--------------------|:--------------------|
|  0 | 2019-07-19 07:30:00 | 2019-07-19 08:00:00 |
|  1 | 2019-07-19 08:15:00 | 2019-07-19 09:00:00 |

Here's what it looks if I plot them (colouring in the part between Start and Finish of each row):

Think of it as if df1 records times when TV1 was on, and df2 records times when TV2 was on. I would like to find the total amount of time that any TV was on. In the plot above, this is shown with the line df1 or df2.

ADDENDUM

Here's how I made the plot:

import plotly.figure_factory as ff

df3 = pd.DataFrame(
    {
        "Start": {0: "2019-07-19 07:00:00", 1: "2019-07-19 10:00:00",},
        "Finish": {0: "2019-07-19 09:00:00", 1: "2019-07-19 10:30:00",},
    }
)
df1['Resource'] = ['df1']*3
df2['Resource'] = ['df2']*2
df3['Resource'] = ['df1 or df2']*2
df1['Task'] = ['df1']*3
df2['Task'] = ['df2']*2
df3['Task'] = ['df1 or df2']*2

fig = ff.create_gantt(
    pd.concat([df1, df2, df3]).reset_index(drop=True),
    group_tasks=True,
    index_col="Resource",
)
fig.show()

and here's out I printed out the Dataframes:

from tabulate import tabulate

print(df1.pipe(tabulate, headers="keys", tablefmt="pipe"))
print(df2.pipe(tabulate, headers="keys", tablefmt="pipe"))

kabanus · Accepted Answer

Note, the input here is based on the original question.

I'm not sure if this can be done nicely since you are always comparing rows, but one way:

df1['start_time'] = pd.to_datetime(df1['start_time'])
df2['start_time'] = pd.to_datetime(df2['start_time'])
df1['end_time'] = pd.to_datetime(df1['end_time'])
df2['end_time'] = pd.to_datetime(df2['end_time'])

all_events = pd.concat((df1, df2)).sort_values('start_time')
result = all_events.iloc[0:1].copy()
for _, row in all_events.iterrows():
    if row['start_time'] <= result['end_time'].iloc[-1]:
        if row['end_time'] > result['end_time'].iloc[-1]:
            result['end_time'].iloc[-1] = row['end_time']
    else:
        result = result.append(row, ignore_index=True)

print(all_events)
print(result)

The starting part is just to let Pandas handle my time comparison by itself. The basics:

Create a table of all events
sort them by start time
Add the first row to your result
For each new line, is the start time is prior to the last added end time?
- Yes: if the new end time is greater than the previous one, update it.
- No: add this row as a new line to the result, since it does not intersect.

Results for your table (sorted all events and the result):

             end_time          start_time
0 2019-07-19 06:07:10 2019-07-19 06:04:57
1 2019-07-19 06:27:41 2019-07-19 06:26:33
2 2019-07-19 06:35:43 2019-07-19 06:33:18
0 2019-07-19 06:35:53 2019-07-19 06:34:56
1 2019-07-19 06:37:45 2019-07-19 06:36:44
2 2019-07-19 06:40:11 2019-07-19 06:38:33
3 2019-07-19 06:40:25 2019-07-19 06:38:37
4 2019-07-19 07:02:20 2019-07-19 06:59:48
3 2019-07-19 07:06:47 2019-07-19 07:01:20
4 2019-07-19 07:09:19 2019-07-19 07:07:17
             end_time          start_time
0 2019-07-19 06:07:10 2019-07-19 06:04:57
1 2019-07-19 06:27:41 2019-07-19 06:26:33
2 2019-07-19 06:35:53 2019-07-19 06:33:18
3 2019-07-19 06:37:45 2019-07-19 06:36:44
4 2019-07-19 06:40:25 2019-07-19 06:38:33
5 2019-07-19 07:06:47 2019-07-19 06:59:48
6 2019-07-19 07:09:19 2019-07-19 07:07:17

The actual deltas are just:

>>> print(result['end_time'] - result['start_time'])
0   00:02:13
1   00:01:08
2   00:02:35
3   00:01:01
4   00:01:52
5   00:06:59
6   00:02:02

How can I get the total amount of time overlap between Dataframes?

ADDENDUM

Answers (1)

Related Questions