Jon
Jon

Reputation: 2784

Pandas `drop_duplicates` doesn't keep first row

I created a Dataframe that has a duplicate row like the below:

df = pd.DataFrame({"Order Date": ["January 1, 2017", "March 15, 2017", "April 20, 2017", "June 23, 2017", "December 12, 2017", None, "April 20, 2017", "April 20, 2017"], 
         "Sales Person": ["John", "John", "Rick", "Mary", "Mary", "Rick", "Rick", "Rick"],
         "Items Sold": [4, -999, 1, np.nan, 7, 3, 1, 1],
         "Item Price": [4.99, 1.99, 9.99, 19.99, 0.99, 2.99, 9.99, 9.99]})

Which looks like this in Jupyter: Dataframe

If I get the duplicates it correctly shows the two rows that are duplicates.

df[df.duplicated()]

Duplicates

I then call drop_duplicates to drop the second duplicate and keep the first.

df.drop_duplicates()

Dropped

However, it looks like it's removing both rows instead of keeping the first. Am I missing something in the drop_duplicates method? The docstring indicates that the keep parameter defaults to first and this still happens even if I explicitly put in that parameter.

Upvotes: 1

Views: 1572

Answers (1)

BENY
BENY

Reputation: 323286

You have three duplicated row in your example ,using keep= False to see them all

df[df.duplicated(keep=False)]
Out[661]: 
   Item Price  Items Sold      Order Date Sales Person
2        9.99         1.0  April 20, 2017         Rick
6        9.99         1.0  April 20, 2017         Rick
7        9.99         1.0  April 20, 2017         Rick

Then, you do drop_duplicates will only keep the 1st one at row 3 index =2

df.drop_duplicates()
Out[659]: 
   Item Price  Items Sold         Order Date Sales Person
0        4.99         4.0    January 1, 2017         John
1        1.99      -999.0     March 15, 2017         John
2        9.99         1.0     April 20, 2017         Rick
3       19.99         NaN      June 23, 2017         Mary
4        0.99         7.0  December 12, 2017         Mary
5        2.99         3.0               None         Rick

Upvotes: 1

Related Questions