Anindhito Irmandharu
Anindhito Irmandharu

Reputation: 145

Drop Duplicates in a DataFrame where a column are identical and have near timestamps

Currently i have the following dataframe :

    index         timestamp      | id_a | id_b | id_pair
   --------------------------------------------------------
     0       2020-01-01 00:00:00 | 1    | A    |   1A
     1       2020-01-01 00:01:30 | 1    | A    |   1A
     2       2020-01-01 00:02:30 | 1    | A    |   1A
     3       2020-01-01 00:07:30 | 1    | A    |   1A
     4       2020-01-01 00:00:00 | 2    | B    |   2B
     5       2000-01-01 00:00:00 | 3    | C    |   3C
     6       2000-01-01 00:00:00 | 4    | D    |   4D

With dataframe i want to drop the rows who have the same id_pair and timestamp with the range of X minutes, lets say 5 minutes. And therefore the expected result are like this :

    index         timestamp      | id_a | id_b | id_pair
   --------------------------------------------------------
     0       2020-01-01 00:00:00 | 1    | A    |   1A
     3       2020-01-01 00:07:30 | 1    | A    |   1A
     4       2020-01-01 00:00:00 | 2    | B    |   2B
     5       2000-01-01 00:00:00 | 3    | C    |   3C
     6       2000-01-01 00:00:00 | 4    | D    |   4D

After searching to the stackoverflow question, i stumble on this question which has similar problem to mine
Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical



I've recreated the code so that it fits my needs (pretty much the same), and the code looks like this

mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.unique_contact.duplicated(keep=False) & (mask1 | mask1.shift(-1))
df[~mask2]

But when i run the code i'm encountering this error :

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Any help or advice would be apreciated
Thanks in advance



Python version : 3.6.12
Pandas version : 0.25.3

Upvotes: 0

Views: 353

Answers (1)

jezrael
jezrael

Reputation: 862551

First convert column to datetimes and then for expected output remove | mask1.shift(-1):

df['timestamp'] = pd.to_datetime(df['timestamp'])
mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.id_pair.duplicated(keep=False) & mask1
df = df[~mask2]
print (df)
   index  timestamp  id_a id_b id_pair
0      0 2020-01-01     1    A      1A
2      2 2020-01-01     2    B      2B
3      3 2000-01-01     3    C      3C
4      4 2000-01-01     4    D      4D

Upvotes: 1

Related Questions