Reputation: 71
I have data set that looks as this:
id mz_id time duration
1 a 0:00 0:10
1 a 0:01 0:10
1 a 0:05 0:10
1 a 0:07 0:10
1 a 0:09 0:10
1 b 5:10 1:10
1 b 15:10 1:10
1 b 35:10 1:10
1 a 1:10:05 0:15
1 a 1:20:07 0:15
1 b 2:10:05 1:10
1 c 3:20:07 0:15
and so on..
I want to keep first appearance of the value mz_id, to look like this:
id mz_id time duration
1 a 0:00 0:10
1 b 5:10 1:10
1 a 1:10:05 0:15
1 b 2:10:05 1:10
1 c 3:20:07 0:15
The pair mz_id and duration are not unique, and I can not figure out how to delete them in that case... My deleting duplicates should only be based on - to keep always the first appereance of mz_id until it changes.
Upvotes: 1
Views: 834
Reputation: 11
Another way if you want to keep the original index is:
df[df['mz_id'] != df['mz_id'].shift()]
Upvotes: 1
Reputation: 75090
IIUC, use a helper series and groupby+first
:
df.groupby(df['mz_id'].ne(df['mz_id'].shift()).cumsum()).first().reset_index(drop=True)
id mz_id time duration
0 1 a 0:00 0:10
1 1 b 5:10 1:10
2 1 a 1:10:05 0:15
3 1 b 2:10:05 1:10
4 1 c 3:20:07 0:15
Details:
Checking if the next row is same as the current row then doing a cumsum()
to assign a group to each changed group, then groupby that series and returning first
df['mz_id'].ne(df['mz_id'].shift())
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 False
10 True
11 True
Hence this will return true for all the first row which has a same value in the next row.
however as @Datanovice points out , in your example the below would also work since you are just taking the first row.
df[df['mz_id'].ne(df['mz_id'].shift())]
Upvotes: 3