Reputation: 233

drop_duplicates not working in pandas?

The purpose of my code is to import 2 Excel files, compare them, and print out the differences to a new Excel file.

However, after concatenating all the data, and using the drop_duplicates function, the code is accepted by the console. But, when printed to the new excel file, duplicates still remain within the day.

Am I missing something? Is something nullifying the drop_duplicates function?

My code is as follows:

import datetime
import xlrd
import pandas as pd
#identify excel file paths
filepath = r"excel filepath"
filepath2 = r"excel filepath2"
#read relevant columns from the excel files
df1 = pd.read_excel(filepath, sheetname="Sheet1", parse_cols= "B, D, G, O")
df2 = pd.read_excel(filepath2, sheetname="Sheet1", parse_cols= "B, D, F, J")
#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)
#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])
#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df] 
#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df}) 
#remove any duplicates
df.drop_duplicates(subset=None, keep="first", inplace=False)
#print to the console just as a visual aid
print(df)
#print the erroneous entries to an excel file
df.to_excel("Comparison19.xls")

Upvotes: 14

Answers (8)

Collier

Reputation: 11

If anyone else is having this issue then it may also help to check what data types you have included in your df. Apparently, primitive data types seem to be more likely to work correctly with drop_duplicates(). For example, I had a df with PosixPaths that were all identical, but drop_duplicates() would not remove any of them until I changed the PosixPaths to strings.

Upvotes: 1

Marco Hannemann

Reputation: 481

If you are using a DatetimeIndex in your DataFrame this will not work

df.drop_duplicates(subset=None, keep="first", inplace=True)

Instead one can use:

df = df[~df.index.duplicated()]

Like drop_duplicates(), pandas.Index.duplicated supports the keep argument which can be set to first (default), last or False.

Note: Make sure first the index is not of dtype object but datetime64, which you can check using df.index. You may need to convert the index first using

df.index = pd.to_datetime(df.index)

Upvotes: 13

Jan

Reputation: 11

I had the same problem, but a different reason.

After appending one dataframe to another I wanted to de-duplicate based on an id (integer). However, appending changed the type of that column to float and it did not work (see https://github.com/pydata/pandas/issues/6485). I fixed it by running the following before running drop_duplicates:

df = df.astype({'id': 'int64'})

Upvotes: 1

tpainisalsogreat

Reputation: 21

Not sure if this is a good place to put it. But I recently learned that .drop_duplicates() has to have a match in ALL subsets for dropping a row.

So for deleting multiple based on only the one value i used this code:

no_duplicates_df = df.drop_duplicates(subset=['email'], keep="first", inplace=False)                     # Delete duplicates in email
no_duplicates_df = no_duplicates_df.drop_duplicates(subset=['phonenumber'], keep="first", inplace=False) # Delete duplicates in phonenumber

Upvotes: 2

Wizhi

Reputation: 6549

Might help anyone in the future.

I had a column with dates, where I tried to remove duplicates without success. If it's not important to keep the column as a date for further operations, I converted the column from type object to string.

df = df.astype('str')

Then I performed @Keith answers

df = df.drop_duplicates(subset=None, keep="first", inplace=False)

Upvotes: 13

BAC83

Reputation: 891

I have just had this issue, and this was not the solution.

It may be in the docs - I admittedly havent looked - and crucially this is only when dealing with date-based unique rows: the 'date' column must be formatted as such.

If the date data is a pandas object dtype, the drop_duplicates will not work - do a pd.to_datetime first.

Upvotes: 15

Keith

Reputation: 645

You've got inplace=False so you're not modifying df. You want either

 df.drop_duplicates(subset=None, keep="first", inplace=True)

 df = df.drop_duplicates(subset=None, keep="first", inplace=False)

Upvotes: 32

Mohamed Ali JAMAOUI

Reputation: 14689

The use of inplace=False tells pandas to return a new dataframe with duplicates dropped, so you need to assign that back to df:

df = df.drop_duplicates(subset=None, keep="first", inplace=False)

or inplace=True to tell pandas to drop duplicates in the current dataframe

df.drop_duplicates(subset=None, keep="first", inplace=True)

Upvotes: 5

drop_duplicates not working in pandas?

Answers (8)

Related Questions