Reputation: 119
I would like to compare two files based on first field $1
of each file.
Then populate Match Lines from Both the files - ( Available in Aug.csv and Sep.csv) and print last field Remarks as "Matched"
Non- Match Lines from Aug.csv - ( Available in Aug.csv Not-Available in Sep.csv) and print not found (i.e "NOT" ) like 5 times equivalent to No of fields ($NF) in the Sep.csv file "NOT,NOT,NOT,NOT,NOT" and print last field Remarks as "Not in Sep.csv" or FILENAME
Non- Match Lines from Sep.csv - ( Available in Sep.csv Not-Available in Aug.csv) and print not found like (i.e "NOT") 4 times equivalent to No of fields ($NF) in the Aug.csv file "NOT,NOT,NOT,NOT" and print last filed Remarks as "Not in Aug.csv" or FILENAME
Aug.csv
Name,Age,Place,Des
aaa,40,xxx,Aug
aaa,20,yyy,Aug
ccc,35,xxx,Aug
Sep.csv
Name,Age,Place,Edu,Des
aaa,50,zzz,eee,Sep
bbb,30,xxx,yyy,Sep
aaa,60,yyy,fff,Sep
bbb,50,yyy,fff,Sep
Expected Output.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
I have tried below two commands to get the desired output but un-successful
First Command:
awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next}{if (a[$1])print a[$1],$0,"Matched";else print first, $0,"Not in Aug.csv";}' OFS="," Aug.csv Sep.csv >Output.csv
Second Command:
awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next} !($1 in a) {print $0,second,"Not in Sep.csv";}' OFS="," Sep.csv Aug.csv >>Output.csv
got the below Output.csv from above commands
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
Here, I am missing the below two matching lines (Aug.csv) from Expected Output. Please advise how to handle this ... It seems that it's ignoring the duplicated entries
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
Would like to know how this can be a dynamic variable "$first"
and " $second"
( i.e awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT"
) based on No of Fields/Headers available in the Aug.csv & Sep.csv
because in Original files contains more No of fields and used to be vary like 10 fields, 15 fields etc every time ... Don't want to type 10 times "NOT" manually
Or is there any way to REPEAT
function at the time of print "FS"
based on No of Fields in the original file.
so that my output will be formatted below
Expected Output.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv
Kindly advise , Looking for your suggesstions ...
Upvotes: 2
Views: 85
Reputation: 92854
Complex GNU awk solution:
compare.awk script:
function prNot(n) {
r=s="NOT"; while(--n) r=r FS s;
return r
}
BEGIN{ FS=OFS="," }
NR==FNR{
if (NR==1) {
sep_nf=NF; sep_fn=FILENAME; h=$0
} else {
sep[$1][++c]=$2;
for(i=3;i<=NF;i++){ sep[$1][c]=sep[$1][c] FS $i }
}
next
}
FNR==1{
aug_nf=NF; aug_fn=FILENAME; print $0,h,"Remarks"; next
}
$1 in sep{ matched[$1]; for(i in sep[$1]) print $0,$1,sep[$1][i],"Matched" }
!($1 in sep){ print $0,prNot(sep_nf),"Not in "sep_fn }
END{
for(i in sep)
if (!(i in matched)) {
for(j in sep[i]) print prNot(aug_nf),i,sep[i][j],"Not in "aug_fn
}
}
Usage:
awk -f compare.awk Sep.csv Aug.csv
The output:
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
Upvotes: 2
Reputation: 203219
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
for (i=1; i<=NF; i++) {
nots[ARGIND] = (i>1 ? nots[ARGIND] OFS : "") "NOT"
}
}
NR==FNR {
file1[$1][++cnt[$1]] = $0
next
}
{
file2[$1]
if ($1 in file1) {
for (num in file1[$1]) {
print file1[$1][num], $0, (FNR>1 ? "Matched" : "Remarks")
}
}
else {
print nots[1], $0, "Not in " ARGV[1]
}
}
END {
for (name in file1) {
if ( !(name in file2) ) {
for (num in file1[name]) {
print file1[name][num], nots[2], "Not in " ARGV[2]
}
}
}
}
.
$ awk -f tst.awk Aug.csv Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
If the output order matters then there's various ways to handle it...
Upvotes: 2