NCall
NCall

Reputation: 133

Merge overlapping datetime intervals

I have a df with multiple datetime intervals (start time, end time) and values.

input:

id   start          end            value
1    08:00:00.000   12:00:00.000   5
2    09:00:00.000   10:00:00.000   6
2    10:00:00.000   14:00:00.000   4
1    12:00:00.000   15:00:00.000   3

expected output:

id   start          end            value 
1    08:00:00.000   09:00:00.000   5
2    09:00:00.000   10:00:00.000   6
1    10:00:00.000   12:00:00.000   5
2    12:00:00.000   14:00:00.000   4
1    14:00:00.000   15:00:00.000   3

There is overlap between some of them. The objective is to have a succession of interval without overlapping.

diagram enter image description here

When there is overlapping, I want to keep the interval with the highest value.

I coded a thing that loops on the df to find the overlapping intervals, create a new succession of intervals based on the condition and remove the old ones. I would like to find an alternative way, better optimized. Maybe with split of all intervals at the intersections and after, loop on the df and remove the intervals that overlap based on condition.

done = True

while done:
    done = False
    df_copy = df
    for i, row in df.iterrows():
        row_interval = pd.Interval(row.start, row.end)
        if done:
            break
        for j, row_copy in row_copy.iterrows():
            row_copy_interval = pd.Interval(row_copy.start, row_copy.end)
            if i is not j and row_interval.overlaps(row_copy_interval):
                earliest_start = np.minimum(row.start, row_copy.start)
                latest_start = np.maximum(row.start, row_copy.start)
                earliest_end = np.minimum(row.end, row_copy.end)
                latest_end = np.maximum(row.end, row_copy.end)

                if row.value > row_copy.value:
                    value = row.value
                else:
                    value = row_copy.value

                if row_interval == pd.Interval(earliest_start, earliest_end):
                    df = df.append('value': row.value, 'start': earliest_start,'end': latest_start}, ignore_index=True)
                    df = df.append('value': value, 'start': latest_start,'end': earliest_end}, ignore_index=True)
                    df = df.append('value': row_copy.value, 'start': earliest_end,'end': latest_end}, ignore_index=True)
                elif row_interval == pd.Interval(earliest_start, latest_end):
                    ...
                elif row_interval == pd.Interval(latest_start, latest_end):
                    ...
                elif row_interval == pd.Interval(latest_start, earliest_end):
                    ...

                df = df.drop([i, j]).drop_duplicates()
                done = True
                break

Upvotes: 4

Views: 1062

Answers (1)

Guybrush
Guybrush

Reputation: 2790

I'm the maintainer of portion, a Python library to deal with (unions of) intervals of arbitrary (comparable) objects (see https://github.com/AlexandreDecan/portion, also available on PyPI).

Among the features offered by portion, you'll find the IntervalDict class that, basically, acts as a classical dict in which keys are (unions of) intervals. This class can be helpful in your use-case, since it allows you to put all your date(time) intervals into a single structure and to apply some logic on top of it.

An IntervalDict object defines a .merge function that accepts another IntervalDict and takes as input a function to explain how the two IntervalDict instances have to be merged. Using that function, you can specify that the "max" of the values have to be kept for all overlapping intervals. Said in another way: create an IntervalDict instance for each row of your dataframe, then iteratively apply .merge on them using the max function as input, and you'll get, at the end, a list of (key, value) pairs where each key is a (non-overlapping) interval, and each value will be the max over that interval of date(time) values.

Upvotes: 2

Related Questions