Reputation: 12818
For every customer_id I have several start dates and end dates.
When a customer has several overlapping date ranges I would like to reduce those to one line that has the minimum start date of the overlapping date ranges and the maximum end date of the overlapping date ranges.
Here's my example data frame:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-03-01
1 2020-01-03 2020-05-04
1 2020-01-05 2020-06-01
1 2020-01-07 2020-02-02
1 2020-09-03 2020-09-05
1 2020-09-04 2020-09-04
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
This is what the end result should look like:
customer_id start_date end_date
1 2019-01-01 2019-03-01
1 2020-01-02 2020-06-01
1 2020-09-03 2020-09-05
1 2020-10-01 NaT
2 2020-05-01 2020-05-03
I've tried the following already, but that didn't really work out:
Find date range overlap in python
Here's sample code that generated these examples:
import pandas as pd
df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-03', '2020-05-04'],
[1, '2020-01-05', '2020-06-01'],
[1, '2020-01-02', '2020-03-01'],
[1, '2020-01-07', '2020-02-02'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-09-04', '2020-09-04'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df.sort_values(by=['customer_id', 'start_date', 'end_date'])
expected_df = pd.DataFrame(data=[
[1, '2019-01-01', '2019-03-01'],
[1, '2020-01-02', '2020-06-01'],
[1, '2020-09-03', '2020-09-05'],
[1, '2020-10-01', None],
[2, '2020-05-01', '2020-05-03']],
columns=['customer_id', 'start_date', 'end_date'],
)
expected_df['start_date'] = pd.to_datetime(expected_df['start_date'])
expected_df['end_date'] = pd.to_datetime(expected_df['end_date'])
expected_df.sort_values(by=['customer_id', 'start_date', 'end_date'])
Upvotes: 1
Views: 437
Reputation: 12818
Henry Ecker pointed me in the right direction considering this problem as a graph:
Pandas combining rows based on dates
The code only needed a very small bit of rewriting to get the right answer:
from scipy.sparse.csgraph import connected_components
def reductionFunction(data):
# create a 2D graph of connectivity between date ranges
start = data.start_date.values
end = data.end_date.values
graph = (start <= end[:, None]) & (end >= start[:, None])
# find connected components in this graph
n_components, indices = connected_components(graph)
# group the results by these connected components
return data.groupby(indices).aggregate({'start_date': 'min',
'end_date': 'max'})
df.groupby(['customer_id']).apply(reductionFunction).reset_index('customer_id')
Upvotes: 1