jonboy
jonboy

Reputation: 366

Drop consecutive duplicates across multiple columns - Pandas

There's a few questions on this but not using location based indexing of multiple columns: Pandas: Drop consecutive duplicates.

I have a df that may contain consecutive duplicate values across specific rows. I want to remove them for the last two columns only. Using the df below, I want to drop rows where values in year and sale are the same.

I'm getting an error using the query below.

import pandas as pd

df = pd.DataFrame({'month': [1, 4, 7, 10, 12, 12],
               'year': ['12', '14', '14', '13', '15', '15'],
              'sale': ['55', '40', '40', '84', '31', '32']})

cols = df.iloc[:,1:3]

# Option 1
df = df.loc[df[cols] != df['cols'].shift()].reset_index(drop = True)

ValueError: Must pass DataFrame with boolean values only

# Option 2
df = df[df.iloc[:,1:3].diff().ne(0).any(1)].reset_index(drop = True)

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Intended Output:

   month  year  sale
0      1  2012    55
1      4  2014    40
3     10  2013    84
4     12  2014    31
5     12  2014    32

Notes:

1) I need to use index label for selecting columns as the labels will change. I need something fluid.

2) drop_duplicates isn't appropriate here as I only want to drop rows that are the same as the previous row. I don't want to drop the same value altogether.

Upvotes: 0

Views: 1220

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0      1  2012    55
1      4  2014    40
3     10  2013    84
4     12  2014    31
5     12  2014    32

Upvotes: 2

Related Questions