Bas R
Bas R

Reputation: 185

Deleting immediately subsequent rows which are the exact same as the previous for specific columns

I have a dataframe similar to the following.

import pandas as pd

data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000], 
                     'col1': [1,2,2,2,3,4,5,5,6,7], 
                     'col2': [9,2,2,2,9,9,5,5,9,9], 
                     'col3': [11,2,2,2,11,11,5,5,11,11], 
                     'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})

There is an index ind, a number of columns col 1, 2 and 3, and some other column with a value val. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1, col2, col3 here).

This would give me a dataframe like this with indeces 333/444 and 888 removed:

data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000], 
                     'col1': [1,2,3,4,5,6,7], 
                     'col2': [9,2,9,9,5,9,9], 
                     'col3': [11,2,11,11,5,11,11], 
                     'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})

What is the best way to go about this for a larger dataframe?

Upvotes: 1

Views: 39

Answers (1)

mozway
mozway

Reputation: 260780

You can use shift and any for boolean indexing:

cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]

Output:

   ind  col1  col2  col3 val
0  111     1     9    11   a
1  222     2     2     2   b
4  555     3     9    11   e
5  666     4     9    11   f
6  777     5     5     5   g
8  999     6     9    11   i
9    0     7     9    11   j

Intermediates

# shifted dataset
data[cols].shift()

   col1  col2  col3
0   NaN   NaN   NaN
1   1.0   9.0  11.0
2   2.0   2.0   2.0
3   2.0   2.0   2.0
4   2.0   2.0   2.0
5   3.0   9.0  11.0
6   4.0   9.0  11.0
7   5.0   5.0   5.0
8   5.0   5.0   5.0
9   6.0   9.0  11.0

# comparison
data[cols].ne(data[cols].shift())

    col1   col2   col3
0   True   True   True
1   True   True   True
2  False  False  False
3  False  False  False
4   True   True   True
5   True  False  False
6   True   True   True
7  False  False  False
8   True   True   True
9   True  False  False

# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)

0     True
1     True
2    False
3    False
4     True
5     True
6     True
7    False
8     True
9     True
dtype: bool

Upvotes: 1

Related Questions