Reputation: 185
I have a dataframe similar to the following.
import pandas as pd
data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000],
'col1': [1,2,2,2,3,4,5,5,6,7],
'col2': [9,2,2,2,9,9,5,5,9,9],
'col3': [11,2,2,2,11,11,5,5,11,11],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})
There is an index ind
, a number of columns col
1, 2 and 3, and some other column with a value val
. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1
, col2
, col3
here).
This would give me a dataframe like this with indeces 333/444 and 888 removed:
data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000],
'col1': [1,2,3,4,5,6,7],
'col2': [9,2,9,9,5,9,9],
'col3': [11,2,11,11,5,11,11],
'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})
What is the best way to go about this for a larger dataframe?
Upvotes: 1
Views: 39
Reputation: 260780
You can use shift
and any
for boolean indexing:
cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]
Output:
ind col1 col2 col3 val
0 111 1 9 11 a
1 222 2 2 2 b
4 555 3 9 11 e
5 666 4 9 11 f
6 777 5 5 5 g
8 999 6 9 11 i
9 0 7 9 11 j
Intermediates
# shifted dataset
data[cols].shift()
col1 col2 col3
0 NaN NaN NaN
1 1.0 9.0 11.0
2 2.0 2.0 2.0
3 2.0 2.0 2.0
4 2.0 2.0 2.0
5 3.0 9.0 11.0
6 4.0 9.0 11.0
7 5.0 5.0 5.0
8 5.0 5.0 5.0
9 6.0 9.0 11.0
# comparison
data[cols].ne(data[cols].shift())
col1 col2 col3
0 True True True
1 True True True
2 False False False
3 False False False
4 True True True
5 True False False
6 True True True
7 False False False
8 True True True
9 True False False
# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)
0 True
1 True
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 True
dtype: bool
Upvotes: 1