Reputation: 145
Currently i have the following dataframe :
index timestamp | id_a | id_b | id_pair
--------------------------------------------------------
0 2020-01-01 00:00:00 | 1 | A | 1A
1 2020-01-01 00:01:30 | 1 | A | 1A
2 2020-01-01 00:02:30 | 1 | A | 1A
3 2020-01-01 00:07:30 | 1 | A | 1A
4 2020-01-01 00:00:00 | 2 | B | 2B
5 2000-01-01 00:00:00 | 3 | C | 3C
6 2000-01-01 00:00:00 | 4 | D | 4D
With dataframe i want to drop the rows who have the same id_pair and timestamp with the range of X minutes, lets say 5 minutes. And therefore the expected result are like this :
index timestamp | id_a | id_b | id_pair
--------------------------------------------------------
0 2020-01-01 00:00:00 | 1 | A | 1A
3 2020-01-01 00:07:30 | 1 | A | 1A
4 2020-01-01 00:00:00 | 2 | B | 2B
5 2000-01-01 00:00:00 | 3 | C | 3C
6 2000-01-01 00:00:00 | 4 | D | 4D
After searching to the stackoverflow question, i stumble on this question which has similar problem to mine
Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical
I've recreated the code so that it fits my needs (pretty much the same), and the code looks like this
mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.unique_contact.duplicated(keep=False) & (mask1 | mask1.shift(-1))
df[~mask2]
But when i run the code i'm encountering this error :
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Any help or advice would be apreciated
Thanks in advance
Python version : 3.6.12
Pandas version : 0.25.3
Upvotes: 0
Views: 353
Reputation: 862551
First convert column to datetime
s and then for expected output remove | mask1.shift(-1)
:
df['timestamp'] = pd.to_datetime(df['timestamp'])
mask1 = df.groupby('id_pair').timestamp.apply(lambda x: x.diff().dt.seconds < 300)
mask2 = df.id_pair.duplicated(keep=False) & mask1
df = df[~mask2]
print (df)
index timestamp id_a id_b id_pair
0 0 2020-01-01 1 A 1A
2 2 2020-01-01 2 B 2B
3 3 2000-01-01 3 C 3C
4 4 2000-01-01 4 D 4D
Upvotes: 1