Reputation: 441
I am having a dataframe as
df=pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d']).
In this first consecutive‘a’,‘b’and ‘d’values I want to keep. After that onwards if any duplicate values if come means I want to drop it.
So, now my expected output is
['a','a','a','b','b','b','c','d','d','e','f'].
If I use
print(df.drop_duplicates())
it deletes all duplicate values. So, how to get my expected output? Thanks in advance.
Upvotes: 0
Views: 79
Reputation: 880339
Compare each value with its preceeding value to find the start of each run:
df['start'] = df[0] != df[0].shift()
For each group, use cumsum
to find a cumulative sum of the start
values (taking advantage of the fact that Pandas treats True as 1 and False as 0). The cumulative sum can act as a group number:
df['group'] = df.groupby(0)['start'].cumsum()
Then select all rows which are in the first group (i.e., the first run of values):
result = df.loc[df['group'] == 1]
import pandas as pd
df = pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d'])
df['start'] = df[0] != df[0].shift()
df['group'] = df.groupby(0)['start'].cumsum()
result = df.loc[df['group'] == 1]
print(df)
# 0 start group
# 0 a True 1.0
# 1 a False 1.0
# 2 a False 1.0
# 3 b True 1.0
# 4 b False 1.0
# 5 b False 1.0
# 6 c True 1.0
# 7 d True 1.0
# 8 d False 1.0
# 9 a True 2.0
# 10 a False 2.0
# 11 b True 2.0
# 12 b False 2.0
# 13 e True 1.0
# 14 f True 1.0
# 15 d True 2.0
# 16 d False 2.0
df = result[[0]]
print(df)
yields
0
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 d
8 d
13 e
14 f
Upvotes: 1