Reputation: 67
I have a dataframe containing product data with product ID stored as the index value and other attributes as the columns. Due to human error, sometimes duplicate entries for the same item occur and I need to filter out these duplicates. Everything is the same for a given duplicate row from the previous row except the ProductID (index value).
Is there a way to delete rows that contain the exact same values that are clearly duplicate entries despite having different index values? A sample of what i'm referring to is below:
Upvotes: 1
Views: 1810
Reputation: 335
Sure, You can use drop_duplicates()
function from pandas. Look at the below example.
df = pd.DataFrame({
'id': [0, 1, 2, 3, 4],
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]})
df.drop_duplicates(subset=['brand', 'style', 'rating'])
As you can see, I used drop_duplicates()
function with parameter subset=[]
which allows me to set columns, witch, I want to analyze. If you want to read more about this function click here.
To improve the code above you can use
subset=df.columns.difference(['id'])
instead of
subset=df.columns.difference(['id'])
It allows to passing name of the column which shouldn't be analyzed, in your case you should use code like this:
df.drop_duplicates(subset=df.columns.difference(['ProductID']))
Upvotes: 1
Reputation: 368
What you need is:
df.drop_duplicates(inplace=True, ignore_index=True)
The ignore_index=True
will create a new continuos index ProductIDs (without gaps)
Upvotes: 0