Reputation: 767
I am a newbie to Unix and shell scripting. I am trying to find differences between 2 .csv files using Unix command. There are conditions on the basis of which I have to find the difference.
I am able to find the update and new records using comm -23 sorted_file_2.csv sorted_file_1.csv > updates.csv
but not able to find an entry that is deleted.
I have checked solutions using below commands
grep -v -x -f sorted_file_2.csv sorted_file_1.csv > deleted.csv
awk 'NR==FNR{a[$0]=1;next}!a[$0]' sorted_file_2.csv sorted_file_1.csv > deleted.csv
diff sorted_file_1.csv sorted_file_2.csv > deleted.csv
The above commands are always giving me entries that are updated as well as deleted. What I am looking for the only entry from file1 which is not in file2
P.S. The two file can contain all the 3 cases mentioned above. I need output in two csv files. One for update/new records and another one for deleted records.
File1.csv
Row | Employee_ID | Salary | Designation |
---|---|---|---|
1 | John | 2000 | Clerk |
2 | Smith | 3000 | Supervisor |
3 | Jenny | 1000 | Intern |
4 | Vicky | 5000 | Manager |
File2.csv
Row | Employee_ID | Salary | Designation |
---|---|---|---|
1 | John | 2000 | Clerk |
2 | Smith | 4000 | Senior Supervisor |
4 | Vicky | 5000 | Manager |
5 | James | 5000 | Auditor |
In the above 2 files, row #2 in file2 is an update, row #5 is a new entry. Both of them can be combined in single file as
Update_new.csv
Row | Employee_ID | Salary | Designation |
---|---|---|---|
2 | Smith | 4000 | Senior Supervisor |
5 | James | 5000 | Auditor |
Deleted entry is row#3 in file1.csv which is not present in file2.csv to be kept in separate file
deleted.csv
Row | Employee_ID | Salary | Designation |
---|---|---|---|
3 | Jenny | 1000 | Intern |
It is fine even if I am able to add all the two files in single file with one extra column specifying "UPDATED","NEW","DELETED" value.
Upvotes: 0
Views: 1168
Reputation: 37394
Using awk:
$ awk -F\| '
NR==FNR { # store first file to a hash
a[$1]=$0 # $1 is the key
next
}
($1 in a) { # process second file. If key in a
if(a[$1]!=$0) # but data has changed
print > "updated" # output to updated file
delete a[$1] # delete entry from hash
next # skip to next record in file
}
{
print > "updated" # if $1 not in a, it is new, output
}
END { # after file 2
for(i in a) # print all leftovers from file 1
print a[i] > "deleted" # output to deleted
}' 1.csv 2.csv
Results:
$ cat deleted
3|Jenny|1000|Intern
$ cat updated
2|Smith|4000|Senior Supervisor
5|James|5000|Auditor
Upvotes: 0
Reputation: 1502
You can easily do this with diff itself.
Test files I used
1.csv:
a,1
b,2
d,3
f,5
2.csv:
b,2
c,6
d,3
f,4
This is output of diff
$ diff 1.csv 2.csv
1d0
< a,1
2a2
> c,6
4c4
< f,5
---
> f,4
You can search for the letters 'a', 'd', 'c' to get added, deleted and changed lines. Here is an example of how to get added lines
$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a'
2a2
> c,6
You can properly extract only csv using another sed command
$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a' | sed -n '/^[><]/s/[><] \(.*\)$/\1/p'
c,6
You can easily change the grep and sed commands to format in whatever way you want.
Upvotes: 1