Merge 2 files with awk while filling in missing values with 'NA'

Question

File 1:

chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001

File 2:

chr1:763668:T:C, C, T, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781706:A:C, C, A, 0.771, 0, 0, 0, 0, 0, 0, 0, 0.003, 0.001
chr1:782519:A:G, G, A, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011

In merged file, for each column 1 identifier (in file 1 and/or file 2), data present in one file but not the other needs to be represented with NA's. The file needs to be sorted on the numerical value in column 1 (eg chr1:763668:T:C)

Desired output file:

chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011

Code based on previous assistance with a similar request:

awk 'BEGIN{filler="NA, NA, NA, NA, NA, NA, NA, NA, NA, NA"}
 {
                c1[$1]=$1
                c2[$1]=$2
                rest=substr($0, index($0, $3))

                if(NR==FNR){
                crest[$2]=rest
        }
        else{
                if(crest[$2]==""){
                        crest[$2]=filler OFS rest}
                else{
                        crest[$2]=crest[$2] OFS rest
                }
        }
        }
        END{
                for(i in c2){
                        print c1[i],c2[i],crest[i],filler
        }
}' file1 file2 | awk 'NF=26' | sort -k 1

anubhava · Accepted Answer

You may use this awk:

cat merge.awk

BEGIN {
   filler = "NA, NA, NA, NA, NA, NA, NA, NA, NA, NA"
   FS = OFS = ", "
}
{
   k = $1 OFS $2 OFS $3
}
FNR == NR {
   sub(/^([^,]+, ){3}/, "")
   map[k] = $0
   next
}
{
   print $0, (k in map ? map[k] : filler)
   delete map[k]
}
END {
   for (i in map)
      print i, filler, map[i]
}

Then use it as:

awk -f merge.awk file2 file1

chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0, 0.771, 0, 0, 0, 0, 0, 0, 0, 0.003, 0.001
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011

Merge 2 files with awk while filling in missing values with 'NA'

Answers (1)

Related Questions

Merge 2 files with awk while filling in missing values with &#39;NA&#39;

Answers (1)

Related Questions

Merge 2 files with awk while filling in missing values with 'NA'