pandas - removing one of the duplicates if the date is consecutive

Question

Assuming the following dataframe:

date | id
2020-12-11 | A
2020-12-15 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-24 | E
2021-01-19 | E

Desired output:

date | id
2020-12-11 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-19 | E

Basically, if id is duplicate, we want to check if the dates are consecutive. If consecutive, keep both, else, only keep the smaller date. Currently this is what I have, but I feel like there has to be a more efficient way to do so.

df['date'] = df['date'].apply(pd.to_datetime)
for i in range(1, len(df)):
    if date['id'].iloc[i-1] == date['id'].iloc[i]:
        if (abs(df['date'].iloc[i-1] - df['date'].iloc[i])) > datetime.timedelta(days=1): ## check if days are more than 1 day from each other
            print (max(df['date'].iloc[i], df['date'].iloc[i-1])) ## drop this entry, keep the other

Quang Hoang · Accepted Answer

Let us try groupby:

# to_datetime accepts list-like, no need to apply
df['date'] = pd.to_datetime(df['date'])

s = df.groupby('id')['date']
mins, maxs = s.transform('min'), s.transform('max')

df[maxs.sub(mins).le(pd.to_timedelta('1D'))  # ID duplicates and consecutive
   | df['date'].eq(mins)                     # always keep the mins
  ]

Output:

        date id
0 2020-12-11  A
2 2020-04-09  B
3 2020-04-09  C
4 2020-04-08  C
5 2021-03-11  D
6 2021-03-12  D
8 2021-01-19  E

pandas - removing one of the duplicates if the date is consecutive

Answers (2)

Related Questions