Reputation: 3
I have two CSVs. The first one contains a list of all previous customers with IDs assigned to them. And a new csv in which I'm auto generating IDs with following code:
df['ID'] = pd.to_datetime('today').strftime('%m%d%y') + df.index.map(str)
OLD.csv
ID FirstName LastName
1 John Smith
2 Jack Ma
3 John Wick
.... .... ....
210906ABC3 Jon Snow
210907ABC0 Peter Parker
210907ABC1 Tony Stark
NEW.csv with current script
ID FirstName LastName
210908ABC0 Black Widow
210908ABC1 Steve Rogers
210908ABC2 John Wick
210908ABC3 John Rambo
210908ABC4 Tony Stark
I need to compare the FirstName, LastName columns from the CSVs and if the customer already exists in OLD.csv, instead of generating a new ID, it should take the ID value from OLD.csv
Expected output for NEW.csv
ID FirstName LastName
210908ABC1 Black Widow
210908ABC2 Steve Rogers
3 John Wick
210908ABC3 John Rambo
1 John Smith
In the future I might need to compare three or four columns and only assign the IDs if all the columns match. FirstName and LastName and (CellPhone or Address) and (Location or SSN)
Upvotes: 0
Views: 537
Reputation: 610
if you have both files in two dataframes df1 and df2 you can merge the two then update the ID in the first file and print only the columns from the first file, this will only work for files up to a few thousand rows as the merge is quite slow
df2.columns = [x + "_2" for x in df2.columns] # to avoid auto renaming by pd
result = pd.merge(df1, df2, how='left', left_on = key_cols1, right_on = key_cols2)
# update the ID column
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
Edit: this is a simple working example, file1 (df1) is the file you want to update and file2 is the file that contains the IDs you want to copy over to file1
import pandas as pd, numpy as np, argparse, os
parser = argparse.ArgumentParser(description='update id in file1 with id from file2.')
parser.add_argument('-k', help='key column both file', required=True)
parser.add_argument('file1', help='file1 to be updated')
parser.add_argument('file2', help='file2 contains updates for file1')
args = parser.parse_args()
if not os.path.isfile(args.file1): raise ValueError('File does not exist: ' + args.file1)
if not os.path.isfile(args.file2): raise ValueError('File does not exist: ' + args.file2)
df1 = pd.read_csv(args.file1,dtype=str,header=0)
df2 = pd.read_csv(args.file2,dtype=str,header=0)
df2.columns = [x + "_2" for x in df2.columns]
key_col1 = [list(df1.columns)[int(x)] for x in args.k.split(",")]
key_col2 = [list(df2.columns)[int(x)] for x in args.k.split(",")]
result = pd.merge(df1, df2, how='left', left_on = key_col1, right_on = key_col2)
result.ID = np.where(result.ID_2.isnull(), result.ID, result.ID_2)
print(result.to_csv(index=False,columns=df1.columns))
use as follows:
$ python merge.py -k 1,2 file1.csv file2.csv
ID,FirstName,LastName
210908ABC0,Black,Widow
210908ABC1,Steve,Rogers
3,John,Wick
210908ABC3,John,Rambo
210907ABC1,Tony,Stark
make sure that the key is unique per row otherwise you can get multiple joins generating extra rows in the output file.
Upvotes: 1