Reputation: 366
There's a few questions on this but not using location based indexing of multiple columns: Pandas: Drop consecutive duplicates.
I have a df
that may contain consecutive duplicate values across specific rows. I want to remove them for the last two columns only. Using the df
below, I want to drop rows where values in year
and sale
are the same.
I'm getting an error using the query below.
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10, 12, 12],
'year': ['12', '14', '14', '13', '15', '15'],
'sale': ['55', '40', '40', '84', '31', '32']})
cols = df.iloc[:,1:3]
# Option 1
df = df.loc[df[cols] != df['cols'].shift()].reset_index(drop = True)
ValueError: Must pass DataFrame with boolean values only
# Option 2
df = df[df.iloc[:,1:3].diff().ne(0).any(1)].reset_index(drop = True)
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Intended Output:
month year sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32
Notes:
1) I need to use index label for selecting columns as the labels will change. I need something fluid.
2) drop_duplicates
isn't appropriate here as I only want to drop rows that are the same as the previous row. I don't want to drop the same value altogether.
Upvotes: 0
Views: 1220
Reputation: 150785
I want to drop rows where values in year
and sale
are the same That means you can calculate the difference, check if they are equal zero on year
and sale
:
# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)
s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]
Output:
month year sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32
Upvotes: 2