pythonscrub
pythonscrub

Reputation: 75

How can I delete duplicates in two different csv files for a specific column

I have two csv files and both have a common column named 'Name'. How can I compare the two and write a new file without the duplicate values?

Code I have so far:

import pandas as pd
import csv

df1 = pd.read_csv('Stale1.csv')
df2 = pd.read_csv('Stale2.csv')

list1=df1['Name'].tolist()
list2=df2['Name'].tolist()

Example of the columns:

Name
aaa
bbb
ccc
ddd
Name
aaa
vvv
hhh

I would like to write a new file that does not contain 'aaa' as it is the duplicate. Desired output:

Name
bbb
ccc
ddd
vvv
hhh

Upvotes: 0

Views: 45

Answers (1)

kuzand
kuzand

Reputation: 9806

You can concatenate the two dataframes and then use the drop_duplicates method:

df = pd.concat([df1, df2]).drop_duplicates(subset='Name', keep=False, ignore_index=True)

where subset='Name' tells to consider only the column 'Name' for identifying duplicates and keep=False tells to drop all the duplicates.

To get the final result use:

df['Name'].tolist()

Upvotes: 1

Related Questions