Reputation: 79
Assuming the following dataframe:
date | id
2020-12-11 | A
2020-12-15 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-24 | E
2021-01-19 | E
Desired output:
date | id
2020-12-11 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-19 | E
Basically, if id is duplicate, we want to check if the dates are consecutive. If consecutive, keep both, else, only keep the smaller date. Currently this is what I have, but I feel like there has to be a more efficient way to do so.
df['date'] = df['date'].apply(pd.to_datetime)
for i in range(1, len(df)):
if date['id'].iloc[i-1] == date['id'].iloc[i]:
if (abs(df['date'].iloc[i-1] - df['date'].iloc[i])) > datetime.timedelta(days=1): ## check if days are more than 1 day from each other
print (max(df['date'].iloc[i], df['date'].iloc[i-1])) ## drop this entry, keep the other
Upvotes: 1
Views: 119
Reputation: 8768
Try this:
g = df.sort_values('date').groupby('id')
df.loc[g['date'].transform(lambda x: x.diff().dt.days.eq(1).any()) | df['date'].eq(g['date'].transform('min'))]
Output:
date id
0 2020-12-11 A
2 2020-04-09 B
3 2020-04-09 C
4 2020-04-08 C
5 2021-03-11 D
6 2021-03-12 D
8 2021-01-19 E
Old answer:
df2 = pd.concat([df.drop_duplicates(['id']),df.loc[df.assign(dc = df['date'].diff().abs().ne(pd.to_timedelta(1,'D')).cumsum(),idc = df['id'].ne(df['id'].shift()).cumsum()).duplicated(subset=['dc','idc'],keep='first')]]).sort_index()
Upvotes: 0
Reputation: 150745
Let us try groupby
:
# to_datetime accepts list-like, no need to apply
df['date'] = pd.to_datetime(df['date'])
s = df.groupby('id')['date']
mins, maxs = s.transform('min'), s.transform('max')
df[maxs.sub(mins).le(pd.to_timedelta('1D')) # ID duplicates and consecutive
| df['date'].eq(mins) # always keep the mins
]
Output:
date id
0 2020-12-11 A
2 2020-04-09 B
3 2020-04-09 C
4 2020-04-08 C
5 2021-03-11 D
6 2021-03-12 D
8 2021-01-19 E
Upvotes: 4