Keep only first appearance of a value until it changes to another value

Question

I have data set that looks as this:

 id      mz_id      time        duration
  1        a        0:00           0:10
  1        a        0:01           0:10
  1        a        0:05           0:10
  1        a        0:07           0:10
  1        a        0:09           0:10
  1        b        5:10           1:10
  1        b        15:10          1:10
  1        b        35:10          1:10
  1        a        1:10:05        0:15
  1        a        1:20:07        0:15
  1        b        2:10:05        1:10
  1        c        3:20:07        0:15

and so on..

I want to keep first appearance of the value mz_id, to look like this:

 id      mz_id      time        duration
  1        a        0:00           0:10
  1        b        5:10           1:10
  1        a        1:10:05        0:15
  1        b        2:10:05        1:10
  1        c        3:20:07        0:15

The pair mz_id and duration are not unique, and I can not figure out how to delete them in that case... My deleting duplicates should only be based on - to keep always the first appereance of mz_id until it changes.

anky · Accepted Answer

IIUC, use a helper series and groupby+first:

df.groupby(df['mz_id'].ne(df['mz_id'].shift()).cumsum()).first().reset_index(drop=True)

   id mz_id     time duration
0   1     a     0:00     0:10
1   1     b     5:10     1:10
2   1     a  1:10:05     0:15
3   1     b  2:10:05     1:10
4   1     c  3:20:07     0:15

Details: Checking if the next row is same as the current row then doing a cumsum() to assign a group to each changed group, then groupby that series and returning first

df['mz_id'].ne(df['mz_id'].shift())

0      True
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8      True
9     False
10     True
11     True

Hence this will return true for all the first row which has a same value in the next row.

however as @Datanovice points out , in your example the below would also work since you are just taking the first row.

df[df['mz_id'].ne(df['mz_id'].shift())]

Keep only first appearance of a value until it changes to another value

Answers (2)

Related Questions