shaneo
shaneo

Reputation: 65

How to drop duplicates in csv by pandas library in Python?

I've been looking around tried to get examples but can't get it work the way i want to.

I want to dedupe by 'OrderID' and extract duplicates to seperate CSV. Main thing is I need to be able to change the column which I want to dedupe by, in this case its 'Order ID'.

Example Data set:

ID    Fruit   Order ID    Quantity    Price
1     apple      1111        11       £2.00
2     banana     2222        22       £3.00
3     orange     3333        33       £5.00
4     mango      4444        44       £7.00
5     Kiwi       3333        55       £5.00

Output:

ID    Fruit   Order ID    Quantity    Price
5     Kiwi       3333        55       £5.00

I've tried this:

import pandas as pd

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv')

new_df = df[['ID','Fruit','Order ID','Quantity','Price']].drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Issue i have is it doesn't remove any duplicates.

Upvotes: 0

Views: 860

Answers (2)

techPirate99
techPirate99

Reputation: 144

The error is in second line of code (you should use pd.DataFrame), if you want to use drop_duplicates method.

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')

# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5], 
            'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'], 
            'Order ID': [1111, 2222, 3333, 4444, 5555], 
        'Quantity': [11, 22, 33, 44, 55],
        'Price': [ 2, 3, 5, 7, 5]}

new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

Hope it helps.

Upvotes: 0

mrbTT
mrbTT

Reputation: 1409

You can achieve this by creating a new dataframe with value_counts(), merging and than filtering.

# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']

# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)

# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]

# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]

edit: the drop_duplicates() keeps only unique values, but if it finds duplicates it will remove all values but one. Which one to keep you set it by the argument "keep" which can be 'first' or 'last'

edit2: From your comment you want to export the result to csv. Remember, the way I did above I've separated in 2 DataFrames:

a) All items that had a duplicate removed (df_not_duplicates)

b) Only items that had a duplicate still duplicated (df_filtered)

# Type 1 saving all OrderIds that had duplicates but still with duplicates:
df_filtered.to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

# Type 2, all OrderIDs that had duplicate values, but only 1 line per OrderID
df_filtered.drop_duplicates(subset="OrderID", keep='last').to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

Upvotes: 1

Related Questions