user7609771
user7609771

Reputation:

Compare two csv files with python pandas

I have two csv files both consist of two columns.

The first one has the product id, and the second has the serial number.

I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.

How would you approach this?

import pandas as pd
    A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it.
    B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here
    print(A-B) #set A - set B gives back everything thats only in A.
    print(B-A) # same here, other way around.

Upvotes: 9

Views: 39209

Answers (3)

Shijo
Shijo

Reputation: 9721

You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()])
ds2 = set([ tuple(values) for values in df2.values.tolist()])

ds1.symmetric_difference(ds2)
print df1 ,'\n\n'
print df2,'\n\n'

print pd.DataFrame(list(ds1.difference(ds2))),'\n\n'
print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'

df1

id  Name  score isEnrolled               Comment
0  111  Jack   2.17       True  He was late to class
1  112  Nick   1.11      False             Graduated
2  113   Zoe   4.12       True                   NaN 

df2

    id  Name  score isEnrolled               Comment
0  111  Jack   2.17       True  He was late to class
1  112  Nick   1.21      False             Graduated
2  113   Zoe   4.12      False           On vacation 

Output

     0     1     2      3          4
0  113   Zoe  4.12   True        NaN
1  112  Nick  1.11  False  Graduated 


     0     1     2      3            4
0  113   Zoe  4.12  False  On vacation
1  112  Nick  1.21  False    Graduated 

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

Try this:

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates()
# A - B
pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'")
# B - A
pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")

Upvotes: 4

jezrael
jezrael

Reputation: 862511

I think you need merge:

A = pd.DataFrame({'product id':   [1455,5452,3775],
                    'serial number':[44,55,66]})

print (A)

B = pd.DataFrame({'product id':   [7000,2000,1000],
                    'serial number':[44,55,77]})

print (B)

print (pd.merge(A, B, on='serial number'))
   product id_x  serial number  product id_y
0          1455             44          7000
1          5452             55          2000

Upvotes: 9

Related Questions