Reputation: 5
I have two CSV files, both are different but similar. I would like to compare them and output changes and if a variable has been added or removed. I would like to output the changes in either a new CSV or text file.
Below is some code of what I have already tried as well as the two csv files. I would also be willing to use difflib and output that to a text file.
file1.csv:
name1,2.0001
name2,3.4010
name4,4.0000
name5,1.0000
name6,1.0000
name8,1.9001
name10,2.7654
file2.csv:
name1,3.0000
name2,3.4010
name3,1.0000
name5,1.0901
name6,1.0000
name7,3.4445
name11,8.0009
name12,0.1180
Here is code I have tried:
with open('file1.csv', 'r') as file1, open('file2.csv', 'r') as file2:
file1 = file1.readlines()
file2 = file2.readlines()
with open('new_file.csv', 'w') as outFile:
for line in file2:
if line not in file1:
outFile.write(line)
Expected output would be either a csv file or text file that would show things like:
name1 value changed from 2.0001 to 3.0000
name3 value added
name4 value removed
name5 value changed from 1.0000 to 1.0901
name7 value added
name8 value removed
name10 value removed
name11 value added
name12 value added
Upvotes: 0
Views: 1052
Reputation: 13477
You want to compare two tables. The right tool for the job is relational database.
Your code snippets use Python. Python comes with sqlite3 database engine built-in, but I don't see a reason to use python for the simple processing task you request.
Instead, I'd do it in sqlite3
itself, wrapped in a shell script:
#!/bin/bash
# compare-CSVs.bash
sqlite3 <<EOF
.mode csv
.header on
.separator ',' "\n"
-- import data:
.import file1.csv file1
.import file2.csv file2
-- sadly sqlite does not support full joins, so we will augment left join with data missing from file1.csv:
create table data as
select
file1.*
, file2.*
from (
select
name as file1_name
, value as file1_value
from file1
) file1
left join (
select
name as file2_name
, value as file2_value
from file2
) file2
on file2.file2_name == file1.file1_name
union all
select
file1.*
, file2.*
from file2
left join file1
on file1.name == file2.name
where file1.name is null
;
-- output to stdout:
select
file1_name || ' value removed' as "changes:"
from data
where file2_name is null
union all
select
file2_name || ' value added'
from data
where file1_name is null
union all
select
file1_name || ' value changed from ' || file1_value || ' to ' || file2_value
from data
where file1_value != file2_value
;
.exit
EOF
MWE:
cat > file1.csv <<EOF
name,value
name1,2.0001
name2,3.4010
name4,4.0000
name5,1.0000
name6,1.0000
name8,1.9001
name10,2.7654
EOF
cat > file2.csv <<EOF
name,value
name1,3.0000
name2,3.4010
name3,1.0000
name5,1.0901
name6,1.0000
name7,3.4445
name11,8.0009
name12,0.1180
EOF
./compare-CSVs.bash
Output:
changes:
"name4 value removed"
"name8 value removed"
"name10 value removed"
"name3 value added"
"name7 value added"
"name11 value added"
"name12 value added"
"name1 value changed from 2.0001 to 3.0000"
"name5 value changed from 1.0000 to 1.0901"
Upvotes: 0
Reputation: 40733
My solution is to turn each csv into a dictionary with the first column as the keys and the second column as the values. After that, I can loop through the keys and determine if the corresponding values were changed, removed, or added.
import csv
import re
def csv2dict(filename):
with open(filename) as file_handle:
reader = csv.reader(file_handle)
dict_object = dict(reader)
return dict_object
def separate_text_and_number(value):
text, number = re.match(r'(\D+)(\d+)', value).groups()
number = int(number)
return (text, number)
def main():
""" Entry """
csv1 = csv2dict('file1.csv')
csv2 = csv2dict('file2.csv')
all_keys = csv1.keys() | csv2.keys()
for key in sorted(all_keys, key=separate_text_and_number):
if key not in csv2:
print(f'{key} value removed')
elif key not in csv1:
print(f'{key} value added')
elif csv1[key] != csv2[key]:
print(f'{key} value changed from {csv1[key]} to {csv2[key]}')
if __name__ == '__main__':
main()
name1 value changed from 2.0001 to 3.0000
name3 value added
name4 value removed
name5 value changed from 1.0000 to 1.0901
name7 value added
name8 value removed
name10 value removed
name11 value added
name12 value added
csv2dict
opens a file and converts the contents into a dictionaryseparate_text_and_number
splits name14
into ('name', 14)
to help with sorting the keysdict.keys()
method returns a set-like object which contains all the keys. I use the |
operator to find a union of two sets of keys.separate_text_and_number
Upvotes: 1