Mastrepolo
Mastrepolo

Reputation: 21

Compare multiple columns in two separate csv files and when they match parse for a matching string value

I need to compare multiple columns in two separate csv files and when the first 4 columns all match between the two csv files, make a new csv file with the matching four columns and then add the 5th and 7th column of the second csv file to it as well as parse the first csv file column 8 to any of the columns in second csv file. if they match then take the subsequent columns value. I know this sounds confusing but below shows what I am trying to do

First.csv

Address ID X Y col5 col6 col7 col8 col9 
Bob     16 1 2  x    x    x    ABC  #1
Jon     6  5 7  x    x    x    ABC  #2
Rem     16 4 9  x    x    x    ABC  #3
Bob     16 2 3  x    x    x    ABC  #4

Second.csv

Address ID X Y col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 
Bob     16 1 2  x    x    x    def  #    ABC    #5   ghi    #    jkl   # 
Bob     16 1 3  x    x    x    def  #    ABC    #    ghi    #    jkl   #
Jon     6  5 7  x    x    x    def  #    ghi    #    ABC    #6   jkl   #
Jon     6  5 8  x    x    x    def  #    ghi    #    ABC    #    jkl   #
Rem     16 4 9  x    x    x    def  #    ghi    #    jkl    #    ABC   #7
Rem     16 4 8  x    x    x    def  #    ghi    #    jkl    #    ABC   #
Bob     16 2 3  x    x    x    ABC  #8   def    #    ghi    #    jkl   #
Bob     16 2 6  x    x    x    ABC  #    def    #    ghi    #    jkl   # 

output.csv

Address ID X Y 1 2 3   4   5  6
Bob     16 1 2 x x x  ABC #1 #5
Jon     6  5 7 x x x  ABC #2 #6
Rem     16 4 9 x x x  ABC #3 #7
Bob     16 2 3 x x x  ABC #4 #8

I have been trying to use conditional statements like:

pathFirst = "C:\First.csv"
pathFinal  = "C:\Second.csv"

dataFirst = pd.read_csv(pathFirst)
dataFinal = pd.read_csv(pathFinal)

Output = pd.DataFrame(columns=['status'], index=dataFirst.index)
Output['status'] = (dataFirst['Address'] == dataFinal['address'] & dataFirst['ID'] == dataFinal['ID']).replace([True, False], ['Matching', 'Not Matching'])

Output.to_csv('C:\output.csv')

Just to see if there is a line that is actually comparing correctly but seem like there is just a couple. I know I am doing this wrong and I hope this is explained well enough so I really appreciate any help. Thank you.

EDIT: I only want the first four columns to match. Address, ID, X, Y after that I want to match column 4 string value in First.csv file with the same string value in columns 4,6 8 or 10 in Second.csv. If they match then take the #value from the next column. So if in Second.csv row 1 column 6 matches the string from the First.csv file then I want Second.csv row 1 column 7s #value.

Upvotes: 1

Views: 853

Answers (1)

deadshot
deadshot

Reputation: 9061

Try this

df1 = pd.read_csv('first.csv')
df2 = pd.read_csv('second.csv')

res = df1.merge(df2.iloc[:, :4], on=['Address', 'ID', 'X', 'Y'])

val = [np.where(df2[str(n)].isin(df1['4']), df2[str(n + 1)], np.nan) for n in range(4, 11, 2)]

res['6'] = pd.DataFrame(val).replace('#', np.nan).T.dropna(how='all').bfill(axis=1)[0].to_list()

print(res)

Output:

  Address  ID  X  Y  1  2  3    4   5   6
0     Bob  16  1  2  x  x  x  ABC  #1  #5
1     Jon   6  5  7  x  x  x  ABC  #2  #6
2     Rem  16  4  9  x  x  x  ABC  #3  #7
3     Bob  16  2  3  x  x  x  ABC  #4  #8

Upvotes: 1

Related Questions