Reputation: 185
I'm trying to check whether two time segments in the same file_id (identified by their duration expressed as time_from and time_to) overlap in the following dataframes:
df1
id,file_id,time_from,time_to
1,10,00:00:19,00:00:25
2,12,00:02:39,00:02:49
3,12,00:04:18,00:04:30
4,12,00:05:30,00:05:55
5,15,00:01:35,00:01:38
6,18,00:07:35,00:07:48
df2
id,file_id,time_from,time_to
1,10,00:00:18,00:00:26
2,12,00:02:30,00:02:49
3,12,00:05:28,00:05:56
4,15,00:01:40,00:01:50
Basically I'd like to count the number of times segments overlap in the two dataframes for the same file_id, allowing for a given degree of distance between time_from and time_to (say +/- 4 seconds).
The output should be something like:
file_id, number_of_overlapping_segments
10, 1
12, 2
15, 1
18, 0
Here's the only case when two time segments are not overlapping and the algorithm should return 0 (assuming the distance is > 4 seconds). Every other situation should return 1:
[_____]
[________] [______]
My approach so far has been to use pandas to extend the time values by 4 seconds left and right (limits) of the given time segment and create temporary dataframes that hold the rows with overlapping time segments. Example:
import pandas as pd
left_overlap = []
right_overlap = []
for f in list(set(df1.file_id) & set(df2.file_id)):
for t_from, t_from_lim in list(zip(df2[df2.file_id==f]['time_from'],
df1[df1file_id==f]['time_from_limit'])):
if t_from > t_from_lim:
left_overlap.append(df_2[(df2.file_id==audio) & \
(df2.time_from==t_from)])
df_left_overlap = pd.concat(left_overlap).reset_index(drop=True)
for f in list(set(df1.file_id) & set(df2.file_id)):
for t_to, t_to_lim in list(zip(df_left_overlap[df_left_overlap.file_id==f]['time_to'],
df1[df1.file_id==f]['time_to_limit'])):
if t_to < t_to_lim:
right_overlap.append(
df_left_overlap[(df_left_overlap.file_id==f) & \
(df_left_overlap.time_to==t_to)])
overlap = pd.concat(right_overlap)
I think that this solution isn't efficient and I was hoping to find a more robust way of doing it.
Thanks in advance!
Upvotes: 3
Views: 700
Reputation: 25318
It looks like you're adding properties that are time_from-(4 seconds) and time_to+(4 seconds). (Is that what time_from_limit and time_to_limit are?) It also looks like you're trying to detect overlaps, when it's actually much easier to test whether they don't overlap, and negate it.
For example (pseudocode):
for segment1 in df1:
for segment2 in df2 records with matching fileid:
if not (segment1.['time_from'] >= segment2.['time_to'] + (4 seconds) or
segment2.['time_from'] >= segment1.['time_to'] + (4 seconds)):
# They overlap
counters[fileid]++
Upvotes: 1