Reputation: 169
File 1:
chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001
File 2:
chr1:763668:T:C, C, T, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781706:A:C, C, A, 0.771, 0, 0, 0, 0, 0, 0, 0, 0.003, 0.001
chr1:782519:A:G, G, A, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011
In merged file, for each column 1 identifier (in file 1 and/or file 2), data present in one file but not the other needs to be represented with NA's. The file needs to be sorted on the numerical value in column 1 (eg chr1:763668:T:C)
Desired output file:
chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011
Code based on previous assistance with a similar request:
awk 'BEGIN{filler="NA, NA, NA, NA, NA, NA, NA, NA, NA, NA"}
{
c1[$1]=$1
c2[$1]=$2
rest=substr($0, index($0, $3))
if(NR==FNR){
crest[$2]=rest
}
else{
if(crest[$2]==""){
crest[$2]=filler OFS rest}
else{
crest[$2]=crest[$2] OFS rest
}
}
}
END{
for(i in c2){
print c1[i],c2[i],crest[i],filler
}
}' file1 file2 | awk 'NF=26' | sort -k 1
Upvotes: 1
Views: 102
Reputation: 785731
You may use this awk
:
cat merge.awk
BEGIN {
filler = "NA, NA, NA, NA, NA, NA, NA, NA, NA, NA"
FS = OFS = ", "
}
{
k = $1 OFS $2 OFS $3
}
FNR == NR {
sub(/^([^,]+, ){3}/, "")
map[k] = $0
next
}
{
print $0, (k in map ? map[k] : filler)
delete map[k]
}
END {
for (i in map)
print i, filler, map[i]
}
Then use it as:
awk -f merge.awk file2 file1
chr1:763668:T:C, C, T, 0.002, 0.001, 0, 0.001, 0.002, 0.002, 0.003, 0, 0.002, 0.690, 0.004, 0.001, 0, 0.002, 0, 0.001, 0.003, 0, 0.001, 0
chr1:775340:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.687, 0, 0.02, 0, 0, 0, 0, 0
chr1:781598:C:T, T, C, 0, 0, 0, 0.001, 0.001, 0.001, 0.002, 0, 0, 0.688, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
chr1:781706:A:C, C, A, 0, 0, 0.687, 0, 0, 0.003, 0, 0, 0.002, 0, 0.771, 0, 0, 0, 0, 0, 0, 0, 0.003, 0.001
chr1:782519:A:G, G, A, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.001, 0, 0, 0, 0.802, 0, 0, 0, 0.002, 0, 0.002
chr1:787139:A:G, G, A, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.001, 0.001, 0, 0.001, 0.003, 0.006, 0.002, 0, 0.003, 0.011
Upvotes: 1