Reputation:
I have two csv files both consist of two columns.
The first one has the product id, and the second has the serial number.
I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.
How would you approach this?
import pandas as pd
A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
print(A-B) #set A - set B gives back everything thats only in A.
print(B-A) # same here, other way around.
Upvotes: 9
Views: 39209
Reputation: 9721
You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference
ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])
ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'
print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'
df1
id Name score isEnrolled Comment
0 111 Jack 2.17 True He was late to class
1 112 Nick 1.11 False Graduated
2 113 Zoe 4.12 True NaN
df2
id Name score isEnrolled Comment
0 111 Jack 2.17 True He was late to class
1 112 Nick 1.21 False Graduated
2 113 Zoe 4.12 False On vacation
Output
0 1 2 3 4
0 113 Zoe 4.12 True NaN
1 112 Nick 1.11 False Graduated
0 1 2 3 4
0 113 Zoe 4.12 False On vacation
1 112 Nick 1.21 False Graduated
Upvotes: 1
Reputation: 210832
Try this:
A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")
Upvotes: 4
Reputation: 862511
I think you need merge
:
A = pd.DataFrame({'product id': [1455,5452,3775],
'serial number':[44,55,66]})
print (A)
B = pd.DataFrame({'product id': [7000,2000,1000],
'serial number':[44,55,77]})
print (B)
print (pd.merge(A, B, on='serial number'))
product id_x serial number product id_y
0 1455 44 7000
1 5452 55 2000
Upvotes: 9