Reputation: 85
I currently have a version of a script that compares two csv files by reading both into a list/set one after the other. However, the csvs are currently too large for memory, so I would like to iterate line by line and print out lines which are different.
i can't do two loops of csvreaders because then the inner one will read the whole file while the outer loop will be on the first line
i'd rather do this than diff so i can print additional info out when lines don't match to figure out why
Edit: so far i did something like this but the csvs are too large to load into a list all at once
def readFile(filename,columns):
mylist=[]
with open(filename,'rb') as f:
reader = csv.reader(f)
for line in reader:
mylist.append(tuple(line[i] for i in columns))
return mylist
mylist1=readFile(filename1,columns)
mylist2=readFile(filename2,columns)
diff1=diff(mylist1,mylist2)
diff2=diff(mylist2,mylist2)
Upvotes: 0
Views: 14456
Reputation: 48
the following code iterates over each line and gives the line numbers and the positions and items of the differing entries as output. This solution (python3) is convenient if the lines have many items, assuming a comma is used as delimiter.
def compare_csv(file_1, file_2):
with open(file_1, 'r') as csv_1:
with open(file_2, 'r') as csv_2:
reader1 = csv.reader(csv_1, delimiter=',')
reader2 = csv.reader(csv_2, delimiter=',')
try:
i_line = 0
while True:
row_1 = next(reader1)
row_2 = next(reader2)
i_line += 1
if row_1 != row_2:
for k, item in enumerate(row_1):
if item != row_2[k]:
print(f"difference Line {i_line}: ")
print(f"item {k}: {row_1[k]}")
print(f"item {k}: {row_2[k]}")
except StopIteration:
print("line numbers differ!")
finally:
print(f"lines parsed = {i_line}")
Upvotes: 0
Reputation: 714
You could try this code. Gives you a set of all the matching values.
import csv
result_path = 'result_check.csv'
result_file = open(result_path,'r')
result_reader = csv.reader(result_file)
f1 = {}
for rows in result_reader:
f1[rows[0]] = rows[1]
forest_path = 'pandababy3.csv'
forest_file = open(forest_path, 'r')
forest_reader = csv.reader(forest_file)
f2 = {}
for rows in forest_reader:
f2[rows[0]] = rows[1]
x = len(set(f1.items()).intersection(set(f2.items())))
print(x)
Edit: I am using rows[1], because my file had column headers. Anyone using this code to compare files, without column headers, please use rows[0].
Upvotes: 1
Reputation: 399
You can easily do this using 'awk' if you are using a Linux machine.
paste <(awk -F, '{ print $1;next } file1.csv ) <(awk -F, '{print $1;next } file2.csv) | awk '{ if ($1==$2) print "match" ; else print "mismatch" }'
$1 means first field, $2 second and so on. $0 represents entire line.
Explanation:
Here the field 1 in every line of file1.csv is printed by first awk command ; and field 1 in every line of file2.csv is printed by second awk command. The outputs of these to commands are printed side by side using paste command. Now this is piped to another awk command where $1 is the field1 in file1.csv and $2 is first field in file2.csv. Then these two are compared and the result is printed.
If you wish to print line number you can use "print NR", where NR is the number of record.
Similarly you can compare all the fields using $2 $3 and so on
You can also compare any number of lines using NR<10 to print field 1 of ten lines as in:
awk -F, 'NR<=10{print $1}' file1.csv
in the above command.
Hope that helps.
Upvotes: 0
Reputation: 81936
itertools.izip()
solves this problem elegantly:
import csv
import itertools
reader1 = csv.reader(filename1)
reader2 = csv.reader(filename2)
for lhs, rhs in itertools.izip(reader1, reader2):
if lhs != rhs:
print "difference:", lhs, rhs
Upvotes: 1
Reputation: 50990
Assuming the files have the same number of lines and you're looking only for intra-line differences (that is, comparing line 1 in the first file with line 1 in the second file, and so on) something like this should work:
with open(filename1, 'rb') as f1, open(filename2, 'rb') as f2
rdr1 = CsvReader(f1)
rdr2 = CsvReader(f2)
for file1_line in rdr1:
file2_line = rdr2.next()
# Perform your comparison between file1_line and file2_line here
# and print differences, or accumulate only the differences in a
# results list.
Upvotes: 3
Reputation: 859
i think of an approach to read files line-by-line and apply md5sum/SHA1 sum and later compare these values. if space or any other character is insignificant remove them before applying checksum
Upvotes: 0