user3101
user3101

Reputation: 3

How to Compare two columns in two different csv's (old and new) and update a third column if a value exists in old csv with pandas

I have two CSVs. The first one contains a list of all previous customers with IDs assigned to them. And a new csv in which I'm auto generating IDs with following code:

df['ID'] = pd.to_datetime('today').strftime('%m%d%y') + df.index.map(str)

OLD.csv

ID            FirstName   LastName
1             John        Smith
2             Jack        Ma
3             John        Wick
....          ....        ....
210906ABC3    Jon         Snow
210907ABC0    Peter       Parker
210907ABC1    Tony        Stark

NEW.csv with current script

ID            FirstName   LastName
210908ABC0    Black       Widow
210908ABC1    Steve       Rogers
210908ABC2    John        Wick
210908ABC3    John        Rambo
210908ABC4    Tony        Stark

I need to compare the FirstName, LastName columns from the CSVs and if the customer already exists in OLD.csv, instead of generating a new ID, it should take the ID value from OLD.csv

Expected output for NEW.csv

ID            FirstName   LastName
210908ABC1    Black       Widow
210908ABC2    Steve       Rogers
3             John        Wick
210908ABC3    John        Rambo
1             John        Smith

In the future I might need to compare three or four columns and only assign the IDs if all the columns match. FirstName and LastName and (CellPhone or Address) and (Location or SSN)

Upvotes: 0

Views: 537

Answers (1)

SR3142
SR3142

Reputation: 610

if you have both files in two dataframes df1 and df2 you can merge the two then update the ID in the first file and print only the columns from the first file, this will only work for files up to a few thousand rows as the merge is quite slow

df2.columns = [x + "_2" for x in df2.columns]  # to avoid auto renaming by pd
result = pd.merge(df1, df2, how='left', left_on = key_cols1, right_on = key_cols2)

# update the ID column
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)

print(result.to_csv(index=False,columns=df1.columns))

Edit: this is a simple working example, file1 (df1) is the file you want to update and file2 is the file that contains the IDs you want to copy over to file1

import pandas as pd, numpy as np, argparse, os

parser = argparse.ArgumentParser(description='update id in file1 with id from file2.')
parser.add_argument('-k', help='key column both file', required=True)
parser.add_argument('file1', help='file1 to be updated')
parser.add_argument('file2', help='file2 contains updates for file1')

args = parser.parse_args()
if not os.path.isfile(args.file1): raise ValueError('File does not exist: ' + args.file1)
if not os.path.isfile(args.file2): raise ValueError('File does not exist: ' + args.file2)

df1 = pd.read_csv(args.file1,dtype=str,header=0)
df2 = pd.read_csv(args.file2,dtype=str,header=0)
df2.columns = [x + "_2" for x in df2.columns]

key_col1 = [list(df1.columns)[int(x)] for x in args.k.split(",")]
key_col2 = [list(df2.columns)[int(x)] for x in args.k.split(",")]
result = pd.merge(df1, df2, how='left', left_on = key_col1, right_on = key_col2)

result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)

print(result.to_csv(index=False,columns=df1.columns))

use as follows:

$ python merge.py -k 1,2 file1.csv file2.csv

ID,FirstName,LastName
210908ABC0,Black,Widow
210908ABC1,Steve,Rogers
3,John,Wick
210908ABC3,John,Rambo
210907ABC1,Tony,Stark

make sure that the key is unique per row otherwise you can get multiple joins generating extra rows in the output file.

Upvotes: 1

Related Questions