Zach Cornelison
Zach Cornelison

Reputation: 67

Delete duplicate rows with different index values in Pandas

I have a dataframe containing product data with product ID stored as the index value and other attributes as the columns. Due to human error, sometimes duplicate entries for the same item occur and I need to filter out these duplicates. Everything is the same for a given duplicate row from the previous row except the ProductID (index value).

Is there a way to delete rows that contain the exact same values that are clearly duplicate entries despite having different index values? A sample of what i'm referring to is below:

enter image description here

Upvotes: 1

Views: 1810

Answers (2)

Karol Oleksy
Karol Oleksy

Reputation: 335

Sure, You can use drop_duplicates() function from pandas. Look at the below example.

df = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]})

df.drop_duplicates(subset=['brand', 'style', 'rating'])

As you can see, I used drop_duplicates() function with parameter subset=[] which allows me to set columns, witch, I want to analyze. If you want to read more about this function click here.

To improve the code above you can use

subset=df.columns.difference(['id'])

instead of

subset=df.columns.difference(['id'])

It allows to passing name of the column which shouldn't be analyzed, in your case you should use code like this:

df.drop_duplicates(subset=df.columns.difference(['ProductID']))

Upvotes: 1

motrix
motrix

Reputation: 368

What you need is:

df.drop_duplicates(inplace=True, ignore_index=True)

The ignore_index=True will create a new continuos index ProductIDs (without gaps)

Upvotes: 0

Related Questions