Reputation: 233
The purpose of my code is to import 2 Excel files, compare them, and print out the differences to a new Excel file.
However, after concatenating all the data, and using the drop_duplicates
function, the code is accepted by the console. But, when printed to the new excel file, duplicates still remain within the day.
Am I missing something? Is something nullifying the drop_duplicates
function?
My code is as follows:
import datetime
import xlrd
import pandas as pd
#identify excel file paths
filepath = r"excel filepath"
filepath2 = r"excel filepath2"
#read relevant columns from the excel files
df1 = pd.read_excel(filepath, sheetname="Sheet1", parse_cols= "B, D, G, O")
df2 = pd.read_excel(filepath2, sheetname="Sheet1", parse_cols= "B, D, F, J")
#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)
#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])
#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df]
#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df})
#remove any duplicates
df.drop_duplicates(subset=None, keep="first", inplace=False)
#print to the console just as a visual aid
print(df)
#print the erroneous entries to an excel file
df.to_excel("Comparison19.xls")
Upvotes: 14
Views: 79177
Reputation: 11
If anyone else is having this issue then it may also help to check what data types you have included in your df
. Apparently, primitive data types seem to be more likely to work correctly with drop_duplicates()
. For example, I had a df
with PosixPaths
that were all identical, but drop_duplicates()
would not remove any of them until I changed the PosixPaths
to strings.
Upvotes: 1
Reputation: 481
If you are using a DatetimeIndex in your DataFrame this will not work
df.drop_duplicates(subset=None, keep="first", inplace=True)
Instead one can use:
df = df[~df.index.duplicated()]
Like drop_duplicates()
, pandas.Index.duplicated
supports the keep
argument which can be set to first
(default), last
or False
.
Note: Make sure first the index is not of dtype object
but datetime64
, which you can check using df.index
. You may need to convert the index first using
df.index = pd.to_datetime(df.index)
Upvotes: 13
Reputation: 11
I had the same problem, but a different reason.
After appending one dataframe to another I wanted to de-duplicate based on an id (integer). However, appending changed the type of that column to float and it did not work (see https://github.com/pydata/pandas/issues/6485). I fixed it by running the following before running drop_duplicates:
df = df.astype({'id': 'int64'})
Upvotes: 1
Reputation: 21
Not sure if this is a good place to put it. But I recently learned that .drop_duplicates()
has to have a match in ALL subsets for dropping a row.
So for deleting multiple based on only the one value i used this code:
no_duplicates_df = df.drop_duplicates(subset=['email'], keep="first", inplace=False) # Delete duplicates in email
no_duplicates_df = no_duplicates_df.drop_duplicates(subset=['phonenumber'], keep="first", inplace=False) # Delete duplicates in phonenumber
Upvotes: 2
Reputation: 6549
Might help anyone in the future.
I had a column with dates, where I tried to remove duplicates without success. If it's not important to keep the column as a date for further operations, I converted the column from type object to string.
df = df.astype('str')
Then I performed @Keith answers
df = df.drop_duplicates(subset=None, keep="first", inplace=False)
Upvotes: 13
Reputation: 891
I have just had this issue, and this was not the solution.
It may be in the docs - I admittedly havent looked - and crucially this is only when dealing with date-based unique rows: the 'date' column must be formatted as such.
If the date
data is a pandas object dtype, the drop_duplicates
will not work - do a pd.to_datetime
first.
Upvotes: 15
Reputation: 645
You've got inplace=False
so you're not modifying df
. You want either
df.drop_duplicates(subset=None, keep="first", inplace=True)
or
df = df.drop_duplicates(subset=None, keep="first", inplace=False)
Upvotes: 32
Reputation: 14689
The use of inplace=False
tells pandas to return a new dataframe with duplicates dropped, so you need to assign that back to df
:
df = df.drop_duplicates(subset=None, keep="first", inplace=False)
or inplace=True
to tell pandas to drop duplicates in the current dataframe
df.drop_duplicates(subset=None, keep="first", inplace=True)
Upvotes: 5