Sander van den Oord
Sander van den Oord

Reputation: 12818

Pandas: Reduce overlapping date ranges to one line (per group)

For every customer_id I have several start dates and end dates.

When a customer has several overlapping date ranges I would like to reduce those to one line that has the minimum start date of the overlapping date ranges and the maximum end date of the overlapping date ranges.

Here's my example data frame:

customer_id start_date  end_date
1   2019-01-01  2019-03-01
1   2020-01-02  2020-03-01
1   2020-01-03  2020-05-04
1   2020-01-05  2020-06-01
1   2020-01-07  2020-02-02
1   2020-09-03  2020-09-05
1   2020-09-04  2020-09-04
1   2020-10-01  NaT
2   2020-05-01  2020-05-03

This is what the end result should look like:

customer_id start_date  end_date
1   2019-01-01  2019-03-01
1   2020-01-02  2020-06-01
1   2020-09-03  2020-09-05
1   2020-10-01  NaT
2   2020-05-01  2020-05-03

I've tried the following already, but that didn't really work out:
Find date range overlap in python

Here's sample code that generated these examples:

import pandas as pd

df = pd.DataFrame(data=[
    [1, '2019-01-01', '2019-03-01'],
    [1, '2020-01-03', '2020-05-04'],
    [1, '2020-01-05', '2020-06-01'],
    [1, '2020-01-02', '2020-03-01'],
    [1, '2020-01-07', '2020-02-02'],
    [1, '2020-09-03', '2020-09-05'],
    [1, '2020-09-04', '2020-09-04'],
    [1, '2020-10-01', None], 
    [2, '2020-05-01', '2020-05-03']],
    columns=['customer_id', 'start_date', 'end_date'],

)

df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

df.sort_values(by=['customer_id', 'start_date', 'end_date'])


expected_df = pd.DataFrame(data=[
    [1, '2019-01-01', '2019-03-01'],
    [1, '2020-01-02', '2020-06-01'],
    [1, '2020-09-03', '2020-09-05'],
    [1, '2020-10-01', None], 
    [2, '2020-05-01', '2020-05-03']],
    columns=['customer_id', 'start_date', 'end_date'],
)

expected_df['start_date'] = pd.to_datetime(expected_df['start_date'])
expected_df['end_date'] = pd.to_datetime(expected_df['end_date'])

expected_df.sort_values(by=['customer_id', 'start_date', 'end_date'])

Upvotes: 1

Views: 437

Answers (1)

Sander van den Oord
Sander van den Oord

Reputation: 12818

Henry Ecker pointed me in the right direction considering this problem as a graph:
Pandas combining rows based on dates

The code only needed a very small bit of rewriting to get the right answer:

from scipy.sparse.csgraph import connected_components

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.start_date.values
    end = data.end_date.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'start_date': 'min',
                                            'end_date': 'max'})

df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')

Upvotes: 1

Related Questions