user3752900
user3752900

Reputation: 85

Python Compare two CSV Files line by line

I currently have a version of a script that compares two csv files by reading both into a list/set one after the other. However, the csvs are currently too large for memory, so I would like to iterate line by line and print out lines which are different.

i can't do two loops of csvreaders because then the inner one will read the whole file while the outer loop will be on the first line

i'd rather do this than diff so i can print additional info out when lines don't match to figure out why

Edit: so far i did something like this but the csvs are too large to load into a list all at once

def readFile(filename,columns):
mylist=[]
    with open(filename,'rb') as f:
                reader = csv.reader(f)
                for line in reader:
                    mylist.append(tuple(line[i] for i in columns))
    return mylist

mylist1=readFile(filename1,columns)
mylist2=readFile(filename2,columns)
diff1=diff(mylist1,mylist2)
diff2=diff(mylist2,mylist2)

Upvotes: 0

Views: 14456

Answers (6)

Uwe Schweinsberg
Uwe Schweinsberg

Reputation: 48

the following code iterates over each line and gives the line numbers and the positions and items of the differing entries as output. This solution (python3) is convenient if the lines have many items, assuming a comma is used as delimiter.

def compare_csv(file_1, file_2):

with open(file_1, 'r') as csv_1:
    with open(file_2, 'r') as csv_2:
        reader1 = csv.reader(csv_1, delimiter=',')
        reader2 = csv.reader(csv_2, delimiter=',')
        try:
            i_line = 0
            while True:
                row_1 = next(reader1)
                row_2 = next(reader2)
                i_line += 1
                if row_1 != row_2:
                    for k, item in enumerate(row_1):
                        if item != row_2[k]:
                            print(f"difference Line {i_line}: ")
                            print(f"item {k}: {row_1[k]}")
                            print(f"item {k}: {row_2[k]}")
        except StopIteration:
            print("line numbers differ!")
        finally:
            print(f"lines parsed = {i_line}")

Upvotes: 0

novice_dev
novice_dev

Reputation: 714

You could try this code. Gives you a set of all the matching values.

import csv

result_path   = 'result_check.csv'
result_file = open(result_path,'r')
result_reader = csv.reader(result_file)

f1 = {}
for rows in result_reader:
    f1[rows[0]] = rows[1]


forest_path = 'pandababy3.csv'
forest_file = open(forest_path, 'r')
forest_reader = csv.reader(forest_file)

f2 = {}
for rows in forest_reader:
    f2[rows[0]] = rows[1]

x = len(set(f1.items()).intersection(set(f2.items())))

print(x)

Edit: I am using rows[1], because my file had column headers. Anyone using this code to compare files, without column headers, please use rows[0].

Upvotes: 1

Ramana Reddy
Ramana Reddy

Reputation: 399

You can easily do this using 'awk' if you are using a Linux machine.

paste <(awk -F, '{ print $1;next } file1.csv ) <(awk -F, '{print $1;next } file2.csv) | awk '{ if ($1==$2) print "match" ; else print "mismatch" }'

$1 means first field, $2 second and so on. $0 represents entire line.

Explanation:

Here the field 1 in every line of file1.csv is printed by first awk command ; and field 1 in every line of file2.csv is printed by second awk command. The outputs of these to commands are printed side by side using paste command. Now this is piped to another awk command where $1 is the field1 in file1.csv and $2 is first field in file2.csv. Then these two are compared and the result is printed.

If you wish to print line number you can use "print NR", where NR is the number of record.

Similarly you can compare all the fields using $2 $3 and so on

You can also compare any number of lines using NR<10 to print field 1 of ten lines as in:

awk -F, 'NR<=10{print $1}' file1.csv

in the above command.

Hope that helps.

Upvotes: 0

Bill Lynch
Bill Lynch

Reputation: 81936

itertools.izip() solves this problem elegantly:

import csv
import itertools

reader1 = csv.reader(filename1)
reader2 = csv.reader(filename2)

for lhs, rhs in itertools.izip(reader1, reader2):
    if lhs != rhs:
        print "difference:", lhs, rhs

Upvotes: 1

Larry Lustig
Larry Lustig

Reputation: 50990

Assuming the files have the same number of lines and you're looking only for intra-line differences (that is, comparing line 1 in the first file with line 1 in the second file, and so on) something like this should work:

with open(filename1, 'rb') as f1, open(filename2, 'rb') as f2

  rdr1 = CsvReader(f1)
  rdr2 = CsvReader(f2)

  for file1_line in rdr1:

     file2_line = rdr2.next()

     # Perform your comparison between file1_line and file2_line here
     # and print differences, or accumulate only the differences in a 
     # results list.

Upvotes: 3

Kamyar Ghasemlou
Kamyar Ghasemlou

Reputation: 859

i think of an approach to read files line-by-line and apply md5sum/SHA1 sum and later compare these values. if space or any other character is insignificant remove them before applying checksum

Upvotes: 0

Related Questions