Reputation: 133
I have a df with multiple datetime intervals (start time, end time) and values.
input:
id start end value
1 08:00:00.000 12:00:00.000 5
2 09:00:00.000 10:00:00.000 6
2 10:00:00.000 14:00:00.000 4
1 12:00:00.000 15:00:00.000 3
expected output:
id start end value
1 08:00:00.000 09:00:00.000 5
2 09:00:00.000 10:00:00.000 6
1 10:00:00.000 12:00:00.000 5
2 12:00:00.000 14:00:00.000 4
1 14:00:00.000 15:00:00.000 3
There is overlap between some of them. The objective is to have a succession of interval without overlapping.
When there is overlapping, I want to keep the interval with the highest value.
I coded a thing that loops on the df to find the overlapping intervals, create a new succession of intervals based on the condition and remove the old ones. I would like to find an alternative way, better optimized. Maybe with split of all intervals at the intersections and after, loop on the df and remove the intervals that overlap based on condition.
done = True
while done:
done = False
df_copy = df
for i, row in df.iterrows():
row_interval = pd.Interval(row.start, row.end)
if done:
break
for j, row_copy in row_copy.iterrows():
row_copy_interval = pd.Interval(row_copy.start, row_copy.end)
if i is not j and row_interval.overlaps(row_copy_interval):
earliest_start = np.minimum(row.start, row_copy.start)
latest_start = np.maximum(row.start, row_copy.start)
earliest_end = np.minimum(row.end, row_copy.end)
latest_end = np.maximum(row.end, row_copy.end)
if row.value > row_copy.value:
value = row.value
else:
value = row_copy.value
if row_interval == pd.Interval(earliest_start, earliest_end):
df = df.append('value': row.value, 'start': earliest_start,'end': latest_start}, ignore_index=True)
df = df.append('value': value, 'start': latest_start,'end': earliest_end}, ignore_index=True)
df = df.append('value': row_copy.value, 'start': earliest_end,'end': latest_end}, ignore_index=True)
elif row_interval == pd.Interval(earliest_start, latest_end):
...
elif row_interval == pd.Interval(latest_start, latest_end):
...
elif row_interval == pd.Interval(latest_start, earliest_end):
...
df = df.drop([i, j]).drop_duplicates()
done = True
break
Upvotes: 4
Views: 1062
Reputation: 2790
I'm the maintainer of portion
, a Python library to deal with (unions of) intervals of arbitrary (comparable) objects (see https://github.com/AlexandreDecan/portion, also available on PyPI).
Among the features offered by portion
, you'll find the IntervalDict
class that, basically, acts as a classical dict in which keys are (unions of) intervals. This class can be helpful in your use-case, since it allows you to put all your date(time) intervals into a single structure and to apply some logic on top of it.
An IntervalDict
object defines a .merge
function that accepts another IntervalDict
and takes as input a function to explain how the two IntervalDict
instances have to be merged. Using that function, you can specify that the "max" of the values have to be kept for all overlapping intervals. Said in another way: create an IntervalDict
instance for each row of your dataframe, then iteratively apply .merge
on them using the max
function as input, and you'll get, at the end, a list of (key, value)
pairs where each key
is a (non-overlapping) interval, and each value
will be the max over that interval of date(time) values.
Upvotes: 2