Zach
Zach

Reputation: 650

Comparing content in two csv files

So I have two csv files. Book1.csv has more data than similarities.csv so I want to pull out the rows in Book1.csv that do not occur in similarities.csv Here's what I have so far

    with open('Book1.csv', 'rb') as csvMasterForDiff:
        with open('similarities.csv', 'rb') as csvSlaveForDiff:
            masterReaderDiff = csv.reader(csvMasterForDiff)
            slaveReaderDiff = csv.reader(csvSlaveForDiff)        

            testNotInCount = 0
            testInCount = 0
            for row in masterReaderDiff:
                if row not in slaveReaderDiff:
                    testNotInCount = testNotInCount + 1
                else :
                    testInCount = testInCount + 1


print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))

However, the results are

Not in file: 2093
Exists in file: 0

I know this is incorrect because at least the first 16 entries in Book1.csv do not exist in similarities.csv not all of them. What am I doing wrong?

Upvotes: 1

Views: 90

Answers (2)

Pankaj Singhal
Pankaj Singhal

Reputation: 16053

After converting it into sets, you can do a lot of set related & helpful operation without writing much of a code.

slave_rows = set(slaveReaderDiff)
master_rows = set(masterReaderDiff)

master_minus_slave_rows = master_rows - slave_rows
common_rows = master_rows & slave_rows

print('Not in file: '+ str(len(master_minus_slave_rows)))
print('Exists in file: '+ str(len(common_rows)))

Here are various set operations that you can do.

Upvotes: 0

Eugene Yarmash
Eugene Yarmash

Reputation: 149736

A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:

slave_rows = set(slaveReaderDiff)

for row in masterReaderDiff:
    if row not in slave_rows:
        testNotInCount += 1
    else:
        testInCount += 1

Upvotes: 1

Related Questions