Bk thomas
Bk thomas

Reputation: 17

Merge overlapping operation in csv without Pandas

I have a csv file that contains data like that

Sample csv

Name Start End
John 12:00 13:00
John 12:10 13:00
John 12:20 13:20
Tom 12:00 13:10
John 13:50 14:00
Jerry 14:00 14:30
Alice 15:00 16:00
Jerry 11:00 15:00

I need to perform Merging operation on overlapping intervals

Before merge

After merge

with pandas its quite easily done. But i need to use it via python core libraries

I am tryring to do with csv_reader

import csv

dict = {}
with open('person.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = row["Start"]
        end_time = row["End"]
        dict[name] = [start_time,end_time]
        

but i am not sure how to compare the end and starting index to check overlapping condition

Upvotes: 0

Views: 65

Answers (1)

edg
edg

Reputation: 1180

Sorting the data as a first step makes the computation easier. To avoid using index everywhere I chose to use a NamedTuple :

from collections import namedtuple

Interval = namedtuple("Interval", "name start end")

l = [
    Interval("John",    "12:00",    "13:00"),
    Interval("John",    "12:10",    "13:00"),
    Interval("John",    "12:20",    "13:20"),
    Interval("Tom",     "12:00",    "13:10"),
    Interval("John",    "13:50",    "14:00"),
    Interval("Jerry",   "14:00",    "14:30"),
    Interval("Alice",   "15:00",    "16:00"),
    Interval("Jerry",   "11:00",    "15:00"),
]

sorted_l = sorted(l)
result = []

for interval in sorted_l:
    # init
    if not result:
        result.append(interval)
        continue

    last_interval = result[-1]
    
    # Not the same person
    if last_interval.name != interval.name:
        result.append(interval)

    # No intersection between intervals
    elif interval.start > last_interval.end:
        result.append(interval)

    # Overlap between intervals
    elif last_interval.start <= interval.start <= last_interval.end:
        # Upper bound can be the end of either interval
        upper_bound = max(last_interval.end, interval.end)
        # We overwrite the last interval with the new upper bound
        result[-1] = Interval(last_interval.name, last_interval.start, upper_bound)
    else:
        raise ValueError("Case not handled :(")

print(result)


>[
    Interval(name='Alice', start='15:00', end='16:00'), 
    Interval(name='Jerry', start='11:00', end='15:00'), 
    Interval(name='John', start='12:00', end='13:20'), 
    Interval(name='John', start='13:50', end='14:00'), 
    Interval(name='Tom', start='12:00', end='13:10')
]

Upvotes: 1

Related Questions