Reputation: 1645
I have two files like shown below which are tab-delimited:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
file B
chr1 123 aa c d e ff
chr1 345 aa e f g gg
chr1 123 aa c d e hh
chr1 567 aa z c a ii
chr1 345 bb x q r kk
chr1 789 df f g s ff
chr1 345 sh d t g ll
...
I want to add a new column to file A from file B based on 2 key columns "chr1", "123" i.e.(first two columns are key columns). If the key columns matches in both files, the data in column 7 in file B should be added to column 3 in file A.
For example (chr1 123) key is found twice in file B, therefore 3rd column in file A has ff and hh separated by comma. If the key is not found it should put NA and output should look like as shown below: output:
chr1 123 ff,hh aa b c d
chr1 234 NA a b c d
chr1 345 gg,kk,ll aa b c d
chr1 456 NA a b c d
I achieved this using the awk solution
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$1FS$2]=a[$1FS$2]?a[$1FS$2]","$7:$7;next}{$3=(($1FS$2 in a)?a[$1FS$2]:"NA")FS $3}1' fileB fileA
Now, i would like to add another column 6 along with column 7. Could anyone suggest how to do this? The output looks like:
chr1 123 ff,hh e,e aa b c d
chr1 234 NA NA a b c d
chr1 345 gg,kk,ll g,r,g aa b c d
chr1 456 NA NA a b c d
Thanks
Upvotes: 1
Views: 286
Reputation: 2504
My suggestion is to use another array to track the next variable you want to add, but to keep the code a little more readable, I've made an executable awk script to generalize it a bit:
#!/usr/bin/awk -f
BEGIN { FS="\t"; OFS="\t" }
{ key = $1 FS $2 }
FNR==NR {
updateArray( a, $7 )
updateArray( b, $6 )
next
}
{ $3 = concat( a, concat( b, $3 ) ) }
1
function updateArray( arr, fld ) {
arr[key] = arr[key]!="" ? arr[key] "," fld : fld
}
function concat( arr, suffix ) {
return( (arr[key]=="" ? "NA" : arr[key]) OFS suffix )
}
Here's the breakdown:
FS
and OFS
valueskey
for every line reada
and b
where they are passed by reference to the function updateArray
and the field value is passed by value$3
using the local concat
function1
As another option, you could make the value stored in a single a[key]
equal to all the file B
fields you want represented in $3
and have them separated by OFS
. That would require parsing and reassembling the value in a[key]
every time it changed as file B
is parsed, but would make creating the $3
a simple three part concatenation.
Upvotes: 0