Reputation: 335
Lets assume I have the following df:
category start end
0 A 2015-01-01 2016-01-01
1 A 2016-01-01 2016-06-01
2 A 2016-06-01 2016-12-01
3 B 2016-01-01 2016-07-01
4 B 2018-01-01 2018-08-01
Now for each category, if the previous rows end date is equal to current rows start date, i'd like to "merge" them, and if not just keep them. Keeping the index is not important. The output would be
category start end
0 A 2015-01-01 2016-12-01
3 B 2016-01-01 2016-07-01
4 B 2018-01-01 2018-08-01
a = {'category':['A', 'A', 'A', 'B', 'B']}
b = {'start': [datetime(2015, 1,1).date(), datetime(2016, 1,1).date(), datetime(2016, 6, 1).date(), datetime(2016, 1,1).date(), datetime(2018, 1,1).date()]}
c = {'end':[datetime(2016, 1,1).date(), datetime(2016, 6,1).date(), datetime(2016, 12, 1).date(), datetime(2016, 7,1).date(), datetime(2018, 8,1).date()]}
a.update(b)
a.update(c)
df = pd.DataFrame.from_dict(a)
My original DataFrame has about 8 million rows. By doing this I could get that down to about 1 million. But it has to be able to perform on somewhat "large" data.
Upvotes: 1
Views: 1127
Reputation: 23146
You can try:
stack
all the dates into one columndrop_duplicates
: effectively getting rid of equal end and start datesdf = df.sort_values(["category", "start", "end"])
stacked = df.set_index("category").stack().droplevel(-1).rename("start")
output = stacked.reset_index().drop_duplicates(keep=False).set_index("category")
output["end"] = output["start"].shift(-1)
output = output.iloc[range(0, len(output.index), 2)]
>>> stacked
start end
category
A 2015-01-01 2016-12-01
B 2016-01-01 2016-07-01
B 2018-01-01 2018-08-01
Upvotes: 2
Reputation: 9207
You could do it like this:
Sample data:
import pandas as pd
d = {'category': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'}, 'start': {0: '2015-01-01', 1: '2016-01-01', 2: '2016-06-01', 3: '2016-01-01', 4: '2018-01-01'}, 'end': {0: '2016-01-01', 1: '2016-06-01', 2: '2016-12-01', 3: '2016-07-01', 4: '2018-08-01'}}
df = pd.DataFrame(d)
Code:
# Create a series of boolean essentially marking the start of a new continuous group
df = df.sort_values(['category', 'start', 'end'])
mask_continue = ~df['start'].eq(df['end'].shift())
mask_group = ~df['category'].eq(df['category'].shift())
df['group'] = (mask_continue | mask_group).cumsum()
# group by that above generated group and get first and last value of it
g = df.groupby('group')
g_start = g.head(1).set_index('group')[['category', 'start']]
g_end = g.tail(1).set_index('group')['end']
# combine both
g_start['end'] = g_end
Output:
print(g_start)
category start end
group
1 A 2015-01-01 2016-12-01
2 B 2016-01-01 2016-07-01
3 B 2018-01-01 2018-08-01
Upvotes: 1