Reputation: 767

Finding differences between 2 csv files using unix command or shell scripting

I am a newbie to Unix and shell scripting. I am trying to find differences between 2 .csv files using Unix command. There are conditions on the basis of which I have to find the difference.

Update to an entry i.e. any row in file1 (unique id is employee id) exists in file2 but different value for another column. It is considered as an update. In such case, I need that entry from file2
If an entry exists in file2 and not in the file, It is considered as the addition of new employee. I need that row from file2.
If an entry exists in file1 and not in the file2, It is considered as the deletion of an employee. I need that row from file1.

I am able to find the update and new records using comm -23 sorted_file_2.csv sorted_file_1.csv > updates.csv but not able to find an entry that is deleted.

I have checked solutions using below commands

grep -v -x -f sorted_file_2.csv sorted_file_1.csv > deleted.csv

awk 'NR==FNR{a[$0]=1;next}!a[$0]' sorted_file_2.csv sorted_file_1.csv > deleted.csv

diff sorted_file_1.csv sorted_file_2.csv > deleted.csv

The above commands are always giving me entries that are updated as well as deleted. What I am looking for the only entry from file1 which is not in file2

P.S. The two file can contain all the 3 cases mentioned above. I need output in two csv files. One for update/new records and another one for deleted records.

File1.csv

Row	Employee_ID	Salary	Designation
1	John	2000	Clerk
2	Smith	3000	Supervisor
3	Jenny	1000	Intern
4	Vicky	5000	Manager

File2.csv

Row	Employee_ID	Salary	Designation
1	John	2000	Clerk
2	Smith	4000	Senior Supervisor
4	Vicky	5000	Manager
5	James	5000	Auditor

In the above 2 files, row #2 in file2 is an update, row #5 is a new entry. Both of them can be combined in single file as

Update_new.csv

Row	Employee_ID	Salary	Designation
2	Smith	4000	Senior Supervisor
5	James	5000	Auditor

Deleted entry is row#3 in file1.csv which is not present in file2.csv to be kept in separate file

deleted.csv

Row	Employee_ID	Salary	Designation
3	Jenny	1000	Intern

It is fine even if I am able to add all the two files in single file with one extra column specifying "UPDATED","NEW","DELETED" value.

Upvotes: 0

Answers (2)

James Brown

Reputation: 37394

Using awk:

$ awk -F\| '
NR==FNR {                       # store first file to a hash
    a[$1]=$0                    # $1 is the key
    next
}
($1 in a) {                     # process second file. If key in a
    if(a[$1]!=$0)               # but data has changed
        print > "updated"       # output to updated file
    delete a[$1]                # delete entry from hash
    next                        # skip to next record in file
}
{
    print > "updated"           # if $1 not in a, it is new, output
}
END {                           # after file 2
    for(i in a)                 # print all leftovers from file 1
        print a[i] > "deleted"  # output to deleted
}' 1.csv 2.csv

Results:

$ cat deleted
3|Jenny|1000|Intern
$ cat updated
2|Smith|4000|Senior Supervisor
5|James|5000|Auditor

Upvotes: 0

udiboy1209

Reputation: 1502

You can easily do this with diff itself.

Test files I used

1.csv:

a,1
b,2
d,3
f,5

2.csv:

b,2
c,6
d,3
f,4

This is output of diff

$ diff 1.csv 2.csv
1d0
< a,1
2a2
> c,6
4c4
< f,5
---
> f,4

You can search for the letters 'a', 'd', 'c' to get added, deleted and changed lines. Here is an example of how to get added lines

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a'
2a2
> c,6

You can properly extract only csv using another sed command

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a' | sed -n '/^[><]/s/[><] \(.*\)$/\1/p'
c,6

You can easily change the grep and sed commands to format in whatever way you want.

Upvotes: 1

Finding differences between 2 csv files using unix command or shell scripting

Answers (2)

Related Questions