Reputation: 2784
I created a Dataframe that has a duplicate row like the below:
df = pd.DataFrame({"Order Date": ["January 1, 2017", "March 15, 2017", "April 20, 2017", "June 23, 2017", "December 12, 2017", None, "April 20, 2017", "April 20, 2017"],
"Sales Person": ["John", "John", "Rick", "Mary", "Mary", "Rick", "Rick", "Rick"],
"Items Sold": [4, -999, 1, np.nan, 7, 3, 1, 1],
"Item Price": [4.99, 1.99, 9.99, 19.99, 0.99, 2.99, 9.99, 9.99]})
Which looks like this in Jupyter:
If I get the duplicates it correctly shows the two rows that are duplicates.
df[df.duplicated()]
I then call drop_duplicates
to drop the second duplicate and keep the first.
df.drop_duplicates()
However, it looks like it's removing both rows instead of keeping the first. Am I missing something in the drop_duplicates
method? The docstring indicates that the keep
parameter defaults to first
and this still happens even if I explicitly put in that parameter.
Upvotes: 1
Views: 1572
Reputation: 323286
You have three duplicated row in your example ,using keep= False
to see them all
df[df.duplicated(keep=False)]
Out[661]:
Item Price Items Sold Order Date Sales Person
2 9.99 1.0 April 20, 2017 Rick
6 9.99 1.0 April 20, 2017 Rick
7 9.99 1.0 April 20, 2017 Rick
Then, you do drop_duplicates
will only keep the 1st one at row 3 index =2
df.drop_duplicates()
Out[659]:
Item Price Items Sold Order Date Sales Person
0 4.99 4.0 January 1, 2017 John
1 1.99 -999.0 March 15, 2017 John
2 9.99 1.0 April 20, 2017 Rick
3 19.99 NaN June 23, 2017 Mary
4 0.99 7.0 December 12, 2017 Mary
5 2.99 3.0 None Rick
Upvotes: 1