Pythonic way to Compare two CSV files to track changes

Question

I have a Python Script that generate a CSV (data parsed from a website). Here is an exemple of the CSV file:

File1.csv

China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;

File2.csv

China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;

As you can see,

China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).

I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).

What I'd like to have is a script which can tell me: - One new line : This;Is;A;New;Line;; - One removed line : China;Beijing;BeiwaiOnline BFSU;;; And so on ... !

I've tried but without any success:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import csv

f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')

c1 = csv.reader(f1)
c2 = csv.reader(f2)

now = [row for row in c2]
past = [row for row in c1]

for row in now:
    #print row
    lol = past.index(row)
    print lol

f1.close()
f2.close()

_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Any idea of the best way to proceed ? Thank you so much in advance ;)

EDIT:

import csv

f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')

c1 = csv.reader(f1)
c2 = csv.reader(f2)

s1 = set(c1)
s2 = set(c2)

lol = s1 - s2
print type(lol)
print lol

This seems to be a good idea but :

Traceback (most recent call last):
  File "compare.py", line 20, in 
    s1 = set(c1)
TypeError: unhashable type: 'list'

EDIT 2 (Please don't care about what is above): *with your help, here is the script I'm writing :*

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv


### COMPARISON THING ###
x=0

fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
    if '.csv' in fichier:
        print('%s -----> %s' % (x,fichier))
        x=x+1

choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file

s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE

added = [";".join(row) for row in s_now - s_past]  # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now]  # in "past" but not in "now"

c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
    line.append(item_added)
    c.writerow(['AD',item_added])

line = ['RM']
for item_removed in removed:
    line.append(item_removed)
    c.writerow(line)

Two kind of errors:

  File "programcompare.py", line 21, in 
    s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte

or

  File "programcompare.py", line 21, in 
    s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string

It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)

Sorry, last question !

Shawn Chin · Accepted Answer

If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:

s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))

To get the list of entries that were added:

added = [";".join(row) for row in s_now - s_past]  # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.

similarly, list of entries that were removed:

removed = [";".join(row) for row in s_past - s_now]  # in "past" but not in "now"

To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.

To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.

To address the additional errors you're seeing, they are both due to the content of your CSV files.

_csv.Error: newline inside string

It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.

I did however manage to reproduce the error as such:

>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
  File "", line 1, in 
_csv.Error: newline inside string

In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)

>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]

_csv.Error: line contains NULL byte

I'm guessing this might be down to the encoding of your CSV files. See the following questions:

Pythonic way to Compare two CSV files to track changes

Answers (2)

Related Questions