Reputation: 45
I see no change after calling pandas.drop_duplicates() on the dataframe I'm working on in Python.
df = pd.read_excel('sample_data.xlsx', index_col=0)
df.drop_duplicates()
This is the data I'm working on
Upvotes: 2
Views: 1757
Reputation: 18367
There are two issues that I can see you are having with the code:
drop_duplicates()
will take into account all columns and delete rows that are duplicate in all these rows. If you wish to delete duplicates for a certain column or group of columns then you should use the subset
.inplace
therefore df = df.drop_duplicates(['col_1','col_2'])
And after taking into account these 2 items you should notice the difference.
Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col_1':[1,2,3,3,1],'col_2':[1,1,3,3,1],'col_3':['a','b','c','d','a']})
print(df)
col_1 col_2 col_3
0 1 1 a
1 2 1 b
2 3 3 c
3 3 3 d
4 1 1 a
If we use drop_duplicates()
without any subset, then it will drop rows that all duplicate for all columns. This is row 0 and 4, as they are duplicates for all 3 columns. Since the default is keep='first'
you will keep row 0 and drop 4.
If we wish to use a subset, for instance drop_duplicates(['col_1','col_2'])
then we can expect two groups of duplicate rows 0 and 4 (because their values for col_1 and col_2 are the same) and rows 2 and 3 because you are not taking into account col_3
. Similarly to the first case, you will drop 4 and keep 0, drop row 3 and keep 2.
This would be the output for the first case:
df.drop_duplicates(inplace=True)
print(df)
col_1 col_2 col_3
0 1 1 a
1 2 1 b
2 3 3 c
3 3 3 d
And this one for the second case:
df.drop_duplicates(['col_1','col_2'],inplace=True)
print(df)
col_1 col_2 col_3
0 1 1 a
1 2 1 b
2 3 3 c
Upvotes: 1
Reputation: 639
I believe you need to specify the column if you have no duplicate rows. Something like this for your use case:
df = pd.read_excel('sample_data.xlsx', index_col=0)
col = 'state'
df.drop_duplicates(subset=col)
Upvotes: 0
Reputation: 1691
It drops rows that are completely duplicate.
If a row has everything equal but a column, it will not be a duplicate, and then it won't drop.
Upvotes: 0