Carto_
Carto_

Reputation: 597

Python - Comparing two CSV documents from the same python script from an update to another one

Background : With a Python Script, I scraping data (html) from a Website and put this data in a CSV document.

This CSV document looks like that :

Hong Kong;The Jardine Engineering Corporation Limited
Hong Kong;Towngas
Hong Kong;Tricor Services Limited
Hong Kong;UL International Limitied
Hong Kong;Urban Property Management Limited
Hong Kong;VTECH Corporate Services Ltd.
Vietnam;Cam Ranh Computer Co. Ltd
Vietnam;CFTP Company
Vietnam;Chevron Vietnam

First column : Country

Second column : Name

My file have more than 5000 rows.

I need to compare this CSV document, to another one (from the same script, so same structure) to track the potential changes (if we have new lines, or removed one). The best will be to create a file with all the changes, or print them in the terminal.

*REMEMBER that if something change in the CSV file (one more row) all the data gonna be shifted *

Upvotes: 0

Views: 1469

Answers (3)

Carto_
Carto_

Reputation: 597

OLD_PATH = r'/Users/abelrossignol/Desktop/1.csv'
NEW_PATH = r'/Users/abelrossignol/Desktop/2.csv'

out = open("Out.txt", 'w')

old = open(OLD_PATH, 'r')
old_lines = list(old)
old.close()

new = open(NEW_PATH, 'r')
new_lines = list(new)
new.close()

for line in unified_diff(old_lines, new_lines, fromfile=OLD_PATH, tofile=NEW_PATH):
    out.write(line)
    print("Writter")

Seems to work perfectly. I'm still trying to understand the structure of Out.txt but the most difficult is done.

Thank you very much for your help ;-)

I hope that might be helpful one day for another people.

Upvotes: 0

Li-aung Yip
Li-aung Yip

Reputation: 12486

Use GNU diff. It is a command-line tool designed to do exactly what you want. GUI versions are available.

From Wikipedia:

In computing, diff is a file comparison utility that outputs the differences between two files. It is typically used to show the changes between one version of a file and a former version of the same file. Diff displays the changes made per line for text files. Modern implementations also support binary files.[1] The output is called a "diff", or a patch, since the output can be applied with the Unix program patch. The output of similar file comparison utilities are also called a "diff"; like the use of the word "grep" for describing the act of searching, the word diff is used in jargon as a verb for calculating any difference.[citation needed]

Giving you the benefit of the doubt, you probably tried to Google for something like "Find differences between two csv files from Python". If you forget the fact the files are csv format, or that they were created using Python, a search for find differences between text files would have found GNU diff for you.


Edit:

Adding one line poses no problem for GNU diff. It will find the one line that changed, and tell you about it.

Example:

lws@helios:~$ cat file1
alpha
beta
charlie
delta
echo
foxtrot

lws@helios:~$ cat file2
alpha
beta
charlie
CHAMELEON
delta
echo
foxtrot

lws@helios:~$ diff file1 file2
3a4
> CHAMELEON

Upvotes: 1

Burhan Khalid
Burhan Khalid

Reputation: 174622

Welcome to StackOverflow. :)

Your problem boils down to doing a diff between two lists. This is available in Python via difflib.

This example from the manual should help you:

>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
...              'ore\ntree\nemu\n'.splitlines(1))
>>> diff = list(diff) # materialize the generated delta into a list
>>> print ''.join(restore(diff, 1)),
one
two
three
>>> print ''.join(restore(diff, 2)),
ore
tree
emu

To print the changes to a file:

>>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
>>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
>>> for line in unified_diff(s1, s2, fromfile='before.py', tofile='after.py'):
...     sys.stdout.write(line)   
--- before.py
+++ after.py
@@ -1,4 +1,4 @@
-bacon
-eggs
-ham
+python
+eggy
+hamster
 guido

Upvotes: 1

Related Questions