Reputation: 335

Pandas Dataframe with start and end date, combine connected dates from multiple rows to one row

Lets assume I have the following df:

category    start   end
0   A   2015-01-01  2016-01-01
1   A   2016-01-01  2016-06-01
2   A   2016-06-01  2016-12-01
3   B   2016-01-01  2016-07-01
4   B   2018-01-01  2018-08-01

Now for each category, if the previous rows end date is equal to current rows start date, i'd like to "merge" them, and if not just keep them. Keeping the index is not important. The output would be

category    start   end
0   A   2015-01-01  2016-12-01
3   B   2016-01-01  2016-07-01
4   B   2018-01-01  2018-08-01

a = {'category':['A', 'A', 'A', 'B', 'B']}
b = {'start': [datetime(2015, 1,1).date(), datetime(2016, 1,1).date(), datetime(2016, 6, 1).date(), datetime(2016, 1,1).date(), datetime(2018, 1,1).date()]}
c = {'end':[datetime(2016, 1,1).date(), datetime(2016, 6,1).date(), datetime(2016, 12, 1).date(), datetime(2016, 7,1).date(), datetime(2018, 8,1).date()]}

a.update(b)
a.update(c)
df = pd.DataFrame.from_dict(a)

My original DataFrame has about 8 million rows. By doing this I could get that down to about 1 million. But it has to be able to perform on somewhat "large" data.

Upvotes: 1

Answers (2)

not_speshal

Reputation: 23146

You can try:

stack all the dates into one column
drop_duplicates: effectively getting rid of equal end and start dates
Keep every second row and restructure the result

df = df.sort_values(["category", "start", "end"])
stacked = df.set_index("category").stack().droplevel(-1).rename("start")
output = stacked.reset_index().drop_duplicates(keep=False).set_index("category")
output["end"] = output["start"].shift(-1)
output = output.iloc[range(0, len(output.index), 2)]

>>> stacked
              start        end
category                      
A        2015-01-01 2016-12-01
B        2016-01-01 2016-07-01
B        2018-01-01 2018-08-01

Upvotes: 2

Andreas

Reputation: 9207

You could do it like this:

Sample data:

import pandas as pd
d = {'category': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'}, 'start': {0: '2015-01-01',  1: '2016-01-01',  2: '2016-06-01',  3: '2016-01-01',  4: '2018-01-01'}, 'end': {0: '2016-01-01',  1: '2016-06-01',  2: '2016-12-01',  3: '2016-07-01',  4: '2018-08-01'}}
df = pd.DataFrame(d)

Code:

# Create a series of boolean essentially marking the start of a new continuous group
df = df.sort_values(['category', 'start', 'end'])
mask_continue = ~df['start'].eq(df['end'].shift())
mask_group = ~df['category'].eq(df['category'].shift())
df['group'] = (mask_continue | mask_group).cumsum()

# group by that above generated group and get first and last value of it
g = df.groupby('group')
g_start = g.head(1).set_index('group')[['category', 'start']]
g_end = g.tail(1).set_index('group')['end']

# combine both
g_start['end'] = g_end

Output:

print(g_start)

      category       start         end
group
1            A  2015-01-01  2016-12-01
2            B  2016-01-01  2016-07-01
3            B  2018-01-01  2018-08-01

Upvotes: 1

Pandas Dataframe with start and end date, combine connected dates from multiple rows to one row

Answers (2)

Related Questions