tichy
tichy

Reputation: 79

pandas - removing one of the duplicates if the date is consecutive

Assuming the following dataframe:

date | id
2020-12-11 | A
2020-12-15 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-24 | E
2021-01-19 | E

Desired output:

date | id
2020-12-11 | A
2020-04-09 | B
2020-04-09 | C
2020-04-08 | C
2021-03-11 | D
2021-03-12 | D
2021-01-19 | E

Basically, if id is duplicate, we want to check if the dates are consecutive. If consecutive, keep both, else, only keep the smaller date. Currently this is what I have, but I feel like there has to be a more efficient way to do so.

df['date'] = df['date'].apply(pd.to_datetime)
for i in range(1, len(df)):
    if date['id'].iloc[i-1] == date['id'].iloc[i]:
        if (abs(df['date'].iloc[i-1] - df['date'].iloc[i])) > datetime.timedelta(days=1): ## check if days are more than 1 day from each other
            print (max(df['date'].iloc[i], df['date'].iloc[i-1])) ## drop this entry, keep the other

Upvotes: 1

Views: 119

Answers (2)

rhug123
rhug123

Reputation: 8768

Try this:

g = df.sort_values('date').groupby('id')
df.loc[g['date'].transform(lambda x: x.diff().dt.days.eq(1).any()) | df['date'].eq(g['date'].transform('min'))]

Output:

        date id
0 2020-12-11  A
2 2020-04-09  B
3 2020-04-09  C
4 2020-04-08  C
5 2021-03-11  D
6 2021-03-12  D
8 2021-01-19  E

Old answer:

df2 = pd.concat([df.drop_duplicates(['id']),df.loc[df.assign(dc = df['date'].diff().abs().ne(pd.to_timedelta(1,'D')).cumsum(),idc = df['id'].ne(df['id'].shift()).cumsum()).duplicated(subset=['dc','idc'],keep='first')]]).sort_index()

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150745

Let us try groupby:

# to_datetime accepts list-like, no need to apply
df['date'] = pd.to_datetime(df['date'])

s = df.groupby('id')['date']
mins, maxs = s.transform('min'), s.transform('max')

df[maxs.sub(mins).le(pd.to_timedelta('1D'))  # ID duplicates and consecutive
   | df['date'].eq(mins)                     # always keep the mins
  ]

Output:

        date id
0 2020-12-11  A
2 2020-04-09  B
3 2020-04-09  C
4 2020-04-08  C
5 2021-03-11  D
6 2021-03-12  D
8 2021-01-19  E

Upvotes: 4

Related Questions