Reputation: 793
I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same. E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Upvotes: 1
Views: 296
Reputation: 42916
Using diff
, ne
and any
over axis=1
:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any
of the values in a row are not equal (ne
) to 0
:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
Upvotes: 1
Reputation: 477210
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C']
is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Upvotes: 1
Reputation: 4487
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12
Upvotes: 0
Reputation: 863226
Add DataFrame.drop_duplicates
by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with |
for bitwise OR
:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Upvotes: 2
Reputation: 298
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Upvotes: 0
Reputation: 696
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
Upvotes: 1
Reputation: 149085
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Upvotes: 0