justaguy
justaguy

Reputation: 3022

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).

out.txt tab-delimited

R_Index Chr Start   End Ref Alt Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene  Inheritence ExonicFunc.IDP.refGene  AAChange.IDP.refGene
1   chr1    948846  948846  -   A   upstream    ISG15   -0     .    .   .
2   chr1    948870  948870  C   G   UTR5    ISG15   NM_005101.3:c.-84C>G    .   .
4   chr1    949925  949925  C   T   downstream  ISG15   +6  .   .   .
5   chr1    207646923   207646923   G   A   intronic    CR2 >50 .   .   .
8   chr1    948840  948840  -   C   upstream    ISG15   -6  .   .   .

file2 space-delimited

2 ISG15 NM_005101.3 948846-948956 949363-949919

desired output `tab-delimited'

R_Index Chr Start   End Ref Alt Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene  Inheritence ExonicFunc.IDP.refGene  AAChange.IDP.refGene
1   chr1    948846  948846  -   A   upstream    ISG15   -0:NM_005101.3  .   .   .
2   chr1    948870  948870  C   G   UTR5    ISG15   NM_005101.3:c.-84C>G    .   .
4   chr1    949925  949925  C   T   downstream  ISG15   +6:NM_005101.3  .   .   .
5   chr1    207646923   207646923   G   A   intronic    CR2 >50 .   .   .
8   chr1    948840  948840  -   C   upstream    ISG15   -6:NM_005101.3  .   .   .

Description

lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them 

awk

awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
    FNR != 1 && $9 == "." && $12 == "." && $8 in count {
    for(i = 1; i <= count[$8]; i++)
    if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
    if($4 > mid[$8, i]) {
    sign = "+"
    value = high[$8, i]
} 
    else {
    sign = "-"
    value = low[$8, i]
}
    diff = (value > $4) ? value - $4 : $4 - value
    $9 = (diff > 50) ? ">50" : (sign diff)
    break
}
   if(i > count[$8]) {
   $9 = ">50"
}
   }
   1
   ' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('

Upvotes: 0

Views: 68

Answers (1)

James K. Lowden
James K. Lowden

Reputation: 7837

As far as I can tell, your awk code is OK and your bash usage is wrong.

FS='[- ]' file2 FS='\t' file1 |
  awk if($6 == "-" || $6 == "+")
      printf ":" ;
  'FNR==NR {a[$2]=$3; next}
   a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('

I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

Upvotes: 1

Related Questions