SVR
SVR

Reputation: 119

awk to compare two files and print formatted output

I would like to compare two files based on first field $1 of each file.

Then populate Match Lines from Both the files - ( Available in Aug.csv and Sep.csv) and print last field Remarks as "Matched"

Non- Match Lines from Aug.csv - ( Available in Aug.csv Not-Available in Sep.csv) and print not found (i.e "NOT" ) like 5 times equivalent to No of fields ($NF) in the Sep.csv file "NOT,NOT,NOT,NOT,NOT" and print last field Remarks as "Not in Sep.csv" or FILENAME

Non- Match Lines from Sep.csv - ( Available in Sep.csv Not-Available in Aug.csv) and print not found like (i.e "NOT") 4 times equivalent to No of fields ($NF) in the Aug.csv file "NOT,NOT,NOT,NOT" and print last filed Remarks as "Not in Aug.csv" or FILENAME

Aug.csv

Name,Age,Place,Des
aaa,40,xxx,Aug
aaa,20,yyy,Aug
ccc,35,xxx,Aug

Sep.csv

Name,Age,Place,Edu,Des
aaa,50,zzz,eee,Sep
bbb,30,xxx,yyy,Sep
aaa,60,yyy,fff,Sep
bbb,50,yyy,fff,Sep

Expected Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

I have tried below two commands to get the desired output but un-successful

First Command:

 awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next}{if (a[$1])print a[$1],$0,"Matched";else print first, $0,"Not in Aug.csv";}' OFS="," Aug.csv Sep.csv >Output.csv

Second Command:

awk -v first="NOT,NOT,NOT,NOT"  -v second="NOT,NOT,NOT,NOT,NOT" -F"," 'NR==FNR{a[$1]=$0;next} !($1 in a) {print $0,second,"Not in Sep.csv";}' OFS="," Sep.csv Aug.csv  >>Output.csv  

got the below Output.csv from above commands

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

Here, I am missing the below two matching lines (Aug.csv) from Expected Output. Please advise how to handle this ... It seems that it's ignoring the duplicated entries

aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched

Would like to know how this can be a dynamic variable "$first" and " $second" ( i.e awk -v first="NOT,NOT,NOT,NOT" -v second="NOT,NOT,NOT,NOT,NOT" ) based on No of Fields/Headers available in the Aug.csv & Sep.csv because in Original files contains more No of fields and used to be vary like 10 fields, 15 fields etc every time ... Don't want to type 10 times "NOT" manually Or is there any way to REPEAT function at the time of print "FS" based on No of Fields in the original file. so that my output will be formatted below

Expected Output.csv

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
,,,,bbb,30,xxx,yyy,Sep,Not in Aug.csv
,,,,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,,,,,,Not in Sep.csv

Kindly advise , Looking for your suggesstions ...

Upvotes: 2

Views: 85

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Complex GNU awk solution:

compare.awk script:

function prNot(n) { 
    r=s="NOT"; while(--n) r=r FS s; 
    return r 
}
BEGIN{ FS=OFS="," }
NR==FNR{ 
    if (NR==1) { 
        sep_nf=NF; sep_fn=FILENAME; h=$0 
    } else { 
        sep[$1][++c]=$2; 
        for(i=3;i<=NF;i++){ sep[$1][c]=sep[$1][c] FS $i } 
    }
    next 
}
FNR==1{ 
    aug_nf=NF; aug_fn=FILENAME; print $0,h,"Remarks"; next 
}
$1 in sep{ matched[$1]; for(i in sep[$1]) print $0,$1,sep[$1][i],"Matched" }
!($1 in sep){ print $0,prNot(sep_nf),"Not in "sep_fn }
END{ 
    for(i in sep) 
        if (!(i in matched)) { 
            for(j in sep[i]) print prNot(aug_nf),i,sep[i][j],"Not in "aug_fn 
        }  
}

Usage:

awk -f compare.awk Sep.csv Aug.csv

The output:

Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203219

With GNU awk for true multi-dimensional arrays:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    for (i=1; i<=NF; i++) {
        nots[ARGIND] = (i>1 ? nots[ARGIND] OFS : "") "NOT"
    }
}
NR==FNR {
    file1[$1][++cnt[$1]] = $0
    next
}
{
    file2[$1]
    if ($1 in file1) {
        for (num in file1[$1]) {
            print file1[$1][num], $0, (FNR>1 ? "Matched" : "Remarks")
        }
    }
    else {
        print nots[1], $0, "Not in " ARGV[1]
    }
}
END {
    for (name in file1) {
        if ( !(name in file2) ) {
            for (num in file1[name]) {
                print file1[name][num], nots[2], "Not in " ARGV[2]
            }
        }
    }
}

.

$ awk -f tst.awk Aug.csv Sep.csv
Name,Age,Place,Des,Name,Age,Place,Edu,Des,Remarks
aaa,40,xxx,Aug,aaa,50,zzz,eee,Sep,Matched
aaa,20,yyy,Aug,aaa,50,zzz,eee,Sep,Matched
NOT,NOT,NOT,NOT,bbb,30,xxx,yyy,Sep,Not in Aug.csv
aaa,40,xxx,Aug,aaa,60,yyy,fff,Sep,Matched
aaa,20,yyy,Aug,aaa,60,yyy,fff,Sep,Matched
NOT,NOT,NOT,NOT,bbb,50,yyy,fff,Sep,Not in Aug.csv
ccc,35,xxx,Aug,NOT,NOT,NOT,NOT,NOT,Not in Sep.csv

If the output order matters then there's various ways to handle it...

Upvotes: 2

Related Questions