Reputation: 650
So I have two csv files. Book1.csv
has more data than similarities.csv
so I want to pull out the rows in Book1.csv
that do not occur in similarities.csv
Here's what I have so far
with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)
testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1
print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))
However, the results are
Not in file: 2093
Exists in file: 0
I know this is incorrect because at least the first 16 entries in Book1.csv
do not exist in similarities.csv
not all of them. What am I doing wrong?
Upvotes: 1
Views: 90
Reputation: 16053
After converting it into sets
, you can do a lot of set
related & helpful operation without writing much of a code.
slave_rows = set(slaveReaderDiff)
master_rows = set(masterReaderDiff)
master_minus_slave_rows = master_rows - slave_rows
common_rows = master_rows & slave_rows
print('Not in file: '+ str(len(master_minus_slave_rows)))
print('Exists in file: '+ str(len(common_rows)))
Here are various set operations that you can do.
Upvotes: 0
Reputation: 149736
A csv.reader
object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:
slave_rows = set(slaveReaderDiff)
for row in masterReaderDiff:
if row not in slave_rows:
testNotInCount += 1
else:
testInCount += 1
Upvotes: 1