user1999109
user1999109

Reputation: 441

How to drop after first consecutive duplicate values in pandas dataframe using python?

I am having a dataframe as

df=pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d']).

In this first consecutive‘a’,‘b’and ‘d’values I want to keep. After that onwards if any duplicate values if come means I want to drop it.

So, now my expected output is

['a','a','a','b','b','b','c','d','d','e','f'].

If I use

print(df.drop_duplicates())

it deletes all duplicate values. So, how to get my expected output? Thanks in advance.

Upvotes: 0

Views: 79

Answers (1)

unutbu
unutbu

Reputation: 880339

Compare each value with its preceeding value to find the start of each run:

df['start'] = df[0] != df[0].shift()

For each group, use cumsum to find a cumulative sum of the start values (taking advantage of the fact that Pandas treats True as 1 and False as 0). The cumulative sum can act as a group number:

df['group'] = df.groupby(0)['start'].cumsum()

Then select all rows which are in the first group (i.e., the first run of values):

result = df.loc[df['group'] == 1]

import pandas as pd

df = pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d'])
df['start'] = df[0] != df[0].shift()
df['group'] = df.groupby(0)['start'].cumsum()
result = df.loc[df['group'] == 1]
print(df)
#     0  start  group
# 0   a   True    1.0
# 1   a  False    1.0
# 2   a  False    1.0
# 3   b   True    1.0
# 4   b  False    1.0
# 5   b  False    1.0
# 6   c   True    1.0
# 7   d   True    1.0
# 8   d  False    1.0
# 9   a   True    2.0
# 10  a  False    2.0
# 11  b   True    2.0
# 12  b  False    2.0
# 13  e   True    1.0
# 14  f   True    1.0
# 15  d   True    2.0
# 16  d  False    2.0
df = result[[0]]
print(df)

yields

    0
0   a
1   a
2   a
3   b
4   b
5   b
6   c
7   d
8   d
13  e
14  f

Upvotes: 1

Related Questions