Merge overlapping operation in csv without Pandas

Question

I have a csv file that contains data like that

Sample csv

Name	Start	End
John	12:00	13:00
John	12:10	13:00
John	12:20	13:20
Tom	12:00	13:10
John	13:50	14:00
Jerry	14:00	14:30
Alice	15:00	16:00
Jerry	11:00	15:00

I need to perform Merging operation on overlapping intervals

Before merge

John [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
Jerry [14:00,14:35],[11:00,15:00]
Tom [15:00,16:00]
Alice [12:00,13:10]

After merge

John [12:00,13:20],[13:50,14:00]
Jerry [11:00,15:00]
Alice [12:00 ,13:10]
Tom [15:00,16:00]

with pandas its quite easily done. But i need to use it via python core libraries

I am tryring to do with csv_reader

import csv

dict = {}
with open('person.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = row["Start"]
        end_time = row["End"]
        dict[name] = [start_time,end_time]

but i am not sure how to compare the end and starting index to check overlapping condition

edg · Accepted Answer

Sorting the data as a first step makes the computation easier. To avoid using index everywhere I chose to use a NamedTuple :

from collections import namedtuple

Interval = namedtuple("Interval", "name start end")

l = [
    Interval("John",    "12:00",    "13:00"),
    Interval("John",    "12:10",    "13:00"),
    Interval("John",    "12:20",    "13:20"),
    Interval("Tom",     "12:00",    "13:10"),
    Interval("John",    "13:50",    "14:00"),
    Interval("Jerry",   "14:00",    "14:30"),
    Interval("Alice",   "15:00",    "16:00"),
    Interval("Jerry",   "11:00",    "15:00"),
]

sorted_l = sorted(l)
result = []

for interval in sorted_l:
    # init
    if not result:
        result.append(interval)
        continue

    last_interval = result[-1]
    
    # Not the same person
    if last_interval.name != interval.name:
        result.append(interval)

    # No intersection between intervals
    elif interval.start > last_interval.end:
        result.append(interval)

    # Overlap between intervals
    elif last_interval.start <= interval.start <= last_interval.end:
        # Upper bound can be the end of either interval
        upper_bound = max(last_interval.end, interval.end)
        # We overwrite the last interval with the new upper bound
        result[-1] = Interval(last_interval.name, last_interval.start, upper_bound)
    else:
        raise ValueError("Case not handled :(")

print(result)


>[
    Interval(name='Alice', start='15:00', end='16:00'), 
    Interval(name='Jerry', start='11:00', end='15:00'), 
    Interval(name='John', start='12:00', end='13:20'), 
    Interval(name='John', start='13:50', end='14:00'), 
    Interval(name='Tom', start='12:00', end='13:10')
]

Merge overlapping operation in csv without Pandas

Answers (1)

Related Questions