Reputation: 231
I have a pandas df with hundreds of columns and thousands of rows. Here are the 3 columns that interest us:
ID | startDate | endDate |
---|---|---|
123 | 2020-01-01 | 2020-01-25 |
123 | 2020-01-26 | 2020-02-08 |
123 | 2020-02-09 | 2020-03-12 |
I want for each row with the same ID, merge the rows if the dates follow each others, and keep all other columns intact.
For our example, the output would be a single row because the dates follow:
ID | startDate | endDate |
---|---|---|
123 | 2020-01-01 | 2020-03-12 |
Do you have an idea on how to do it with pandas?
Upvotes: 3
Views: 450
Reputation: 862431
If datetimes are not sorted or not sure use min
and max
for aggregation:
df.groupby('ID', as_index=False).agg({'startDate': 'min', 'endDate': 'max'})
If there is a lot another columns and need aggregate only 2 columns:
df['startDate'] = df.groupby('ID')['startDate'].transform('min')
df['endDate'] = df.groupby('ID')['endDate'].transform('max')
df = df.drop_duplicates('ID')
Upvotes: 4
Reputation: 71560
Try groupby
with agg
and first
with last
:
>>> df.groupby('ID', as_index=False).agg({'startDate': 'first', 'endDate': 'last'})
ID startDate endDate
0 123 2020-01-01 2020-03-12
>>>
Upvotes: 3