Reputation: 17
I have a csv file that contains data like that
Sample csv
Name | Start | End |
---|---|---|
John | 12:00 | 13:00 |
John | 12:10 | 13:00 |
John | 12:20 | 13:20 |
Tom | 12:00 | 13:10 |
John | 13:50 | 14:00 |
Jerry | 14:00 | 14:30 |
Alice | 15:00 | 16:00 |
Jerry | 11:00 | 15:00 |
I need to perform Merging operation on overlapping intervals
Before merge
After merge
with pandas its quite easily done. But i need to use it via python core libraries
I am tryring to do with csv_reader
import csv
dict = {}
with open('person.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
name = row["Name"]
start_time = row["Start"]
end_time = row["End"]
dict[name] = [start_time,end_time]
but i am not sure how to compare the end and starting index to check overlapping condition
Upvotes: 0
Views: 65
Reputation: 1180
Sorting the data as a first step makes the computation easier. To avoid using index everywhere I chose to use a NamedTuple :
from collections import namedtuple
Interval = namedtuple("Interval", "name start end")
l = [
Interval("John", "12:00", "13:00"),
Interval("John", "12:10", "13:00"),
Interval("John", "12:20", "13:20"),
Interval("Tom", "12:00", "13:10"),
Interval("John", "13:50", "14:00"),
Interval("Jerry", "14:00", "14:30"),
Interval("Alice", "15:00", "16:00"),
Interval("Jerry", "11:00", "15:00"),
]
sorted_l = sorted(l)
result = []
for interval in sorted_l:
# init
if not result:
result.append(interval)
continue
last_interval = result[-1]
# Not the same person
if last_interval.name != interval.name:
result.append(interval)
# No intersection between intervals
elif interval.start > last_interval.end:
result.append(interval)
# Overlap between intervals
elif last_interval.start <= interval.start <= last_interval.end:
# Upper bound can be the end of either interval
upper_bound = max(last_interval.end, interval.end)
# We overwrite the last interval with the new upper bound
result[-1] = Interval(last_interval.name, last_interval.start, upper_bound)
else:
raise ValueError("Case not handled :(")
print(result)
>[
Interval(name='Alice', start='15:00', end='16:00'),
Interval(name='Jerry', start='11:00', end='15:00'),
Interval(name='John', start='12:00', end='13:20'),
Interval(name='John', start='13:50', end='14:00'),
Interval(name='Tom', start='12:00', end='13:10')
]
Upvotes: 1