Reputation: 65
I've been looking around tried to get examples but can't get it work the way i want to.
I want to dedupe by 'OrderID' and extract duplicates to seperate CSV. Main thing is I need to be able to change the column which I want to dedupe by, in this case its 'Order ID'.
Example Data set:
ID Fruit Order ID Quantity Price 1 apple 1111 11 £2.00 2 banana 2222 22 £3.00 3 orange 3333 33 £5.00 4 mango 4444 44 £7.00 5 Kiwi 3333 55 £5.00
Output:
ID Fruit Order ID Quantity Price 5 Kiwi 3333 55 £5.00
I've tried this:
import pandas as pd
df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv')
new_df = df[['ID','Fruit','Order ID','Quantity','Price']].drop_duplicates()
new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)
Issue i have is it doesn't remove any duplicates.
Upvotes: 0
Views: 860
Reputation: 144
The error is in second line of code (you should use pd.DataFrame), if you want to use drop_duplicates method.
df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')
# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5],
'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'],
'Order ID': [1111, 2222, 3333, 4444, 5555],
'Quantity': [11, 22, 33, 44, 55],
'Price': [ 2, 3, 5, 7, 5]}
new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()
new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)
Hope it helps.
Upvotes: 0
Reputation: 1409
You can achieve this by creating a new dataframe with value_counts(), merging and than filtering.
# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']
# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)
# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]
# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]
edit: the drop_duplicates() keeps only unique values, but if it finds duplicates it will remove all values but one. Which one to keep you set it by the argument "keep" which can be 'first' or 'last'
edit2: From your comment you want to export the result to csv. Remember, the way I did above I've separated in 2 DataFrames:
a) All items that had a duplicate removed (df_not_duplicates)
b) Only items that had a duplicate still duplicated (df_filtered)
# Type 1 saving all OrderIds that had duplicates but still with duplicates:
df_filtered.to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")
# Type 2, all OrderIDs that had duplicate values, but only 1 line per OrderID
df_filtered.drop_duplicates(subset="OrderID", keep='last').to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")
Upvotes: 1