Omer T
Omer T

Reputation: 45

Can't see the impact of drop_duplicates when used for pandas dataframe

I see no change after calling pandas.drop_duplicates() on the dataframe I'm working on in Python.

df = pd.read_excel('sample_data.xlsx', index_col=0)
df.drop_duplicates()

This is the data I'm working on

Upvotes: 2

Views: 1757

Answers (3)

Celius Stingher
Celius Stingher

Reputation: 18367

There are two issues that I can see you are having with the code:

  1. You are not passing a subset. By default, in panda's documentation, drop_duplicates() will take into account all columns and delete rows that are duplicate in all these rows. If you wish to delete duplicates for a certain column or group of columns then you should use the subset.
  2. You should check the effect of the parameter inplace therefore df = df.drop_duplicates(['col_1','col_2'])

And after taking into account these 2 items you should notice the difference.

Here is an example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'col_1':[1,2,3,3,1],'col_2':[1,1,3,3,1],'col_3':['a','b','c','d','a']})
print(df)

   col_1  col_2 col_3
0      1      1     a
1      2      1     b
2      3      3     c
3      3      3     d
4      1      1     a

If we use drop_duplicates() without any subset, then it will drop rows that all duplicate for all columns. This is row 0 and 4, as they are duplicates for all 3 columns. Since the default is keep='first' you will keep row 0 and drop 4.

If we wish to use a subset, for instance drop_duplicates(['col_1','col_2']) then we can expect two groups of duplicate rows 0 and 4 (because their values for col_1 and col_2 are the same) and rows 2 and 3 because you are not taking into account col_3. Similarly to the first case, you will drop 4 and keep 0, drop row 3 and keep 2. This would be the output for the first case:

df.drop_duplicates(inplace=True)
print(df)
   col_1  col_2 col_3
0      1      1     a
1      2      1     b
2      3      3     c
3      3      3     d

And this one for the second case:

df.drop_duplicates(['col_1','col_2'],inplace=True)
print(df)
   col_1  col_2 col_3
0      1      1     a
1      2      1     b
2      3      3     c

Upvotes: 1

Denver
Denver

Reputation: 639

I believe you need to specify the column if you have no duplicate rows. Something like this for your use case:

df = pd.read_excel('sample_data.xlsx', index_col=0)
col = 'state'
df.drop_duplicates(subset=col)

Upvotes: 0

Alexander Santos
Alexander Santos

Reputation: 1691

It drops rows that are completely duplicate.

If a row has everything equal but a column, it will not be a duplicate, and then it won't drop.

Upvotes: 0

Related Questions