Reputation: 173
[EDITED TO CLARIFY QUESTION]
I have large dataframe (approx. 10k rows) with the first few rows looking like what I'll call df_a:
logtime | zone | value
01/01/2017 06:05:00 | 0 | 14.5
01/01/2017 06:05:00 | 1 | 14.5
01/01/2017 06:05:00 | 2 | 17.0
01/01/2017 06:25:00 | 0 | 14.5
01/01/2017 06:25:00 | 1 | 14.5
01/01/2017 06:25:00 | 2 | 10.0
01/01/2017 06:50:00 | 0 | 10.0
01/01/2017 06:50:00 | 1 | 10.0
01/01/2017 06:50:00 | 2 | 10.0
01/01/2017 07:50:00 | 0 | 14.5
01/01/2017 07:50:00 | 1 | 14.5
01/01/2017 07:50:00 | 2 | 14.5
etc.
I am looking to drop consecutive duplicates, so that I am only left with information about how zones change. For example, if zone 1 is at 14.5 over two logtimes, the duplicate is removed until it changes to 10.0. So that I'm left with a dataframe like:
logtime | zone | value
01/01/2017 06:05:00 | 0 | 14.5
01/01/2017 06:05:00 | 1 | 14.5
01/01/2017 06:05:00 | 2 | 17.0
01/01/2017 06:25:00 | 2 | 10.0
01/01/2017 06:50:00 | 0 | 10.0
01/01/2017 06:50:00 | 1 | 10.0
01/01/2017 07:50:00 | 0 | 14.5
01/01/2017 07:50:00 | 1 | 14.5
01/01/2017 07:50:00 | 2 | 14.5
etc.
My understanding is that drop_duplicates
will only retain unique values, so this doesn't work for my aim.
I also tried to use a .loc and shift method:
removeduplicates = df.loc[ (df.logtime != df.logtime.shift(1)) | (df.zone != df.zone.shift(1)) | (df.value != df.value.shift(1))]
However, this doesn't fail nor does it work to get the desired output. Thanks!
Upvotes: 1
Views: 124
Reputation: 29635
you can create a Boolean mask where the diff between successive values per group of zone is not equal to 0:
print (df[df.groupby(['zone']).value.diff().ne(0)])
logtime zone value
0 01/01/2017 06:05:00 0 14.5
1 01/01/2017 06:05:00 1 14.5
2 01/01/2017 06:05:00 2 17.0
5 01/01/2017 06:25:00 2 10.0
6 01/01/2017 06:50:00 0 10.0
7 01/01/2017 06:50:00 1 10.0
Upvotes: 2