justaguy
justaguy

Reputation: 3022

awk to parse input with multiple conditions

This post in a continuation of:

using awk to parse specific condition and apologize if I should have added to the thread, should I have added it to that post? I have tried to modify the below awk script, but with no luck

awk 'NR==2 {
  split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));
  print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' \
OFS="\t" ${id}_position.txt > ${id}_parse.txt

I have multiple possible condition that a user could input resulting in different output. One of those conditions is in the data sample, with the field in bold needed to be parsed:

` parse rules:
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ### (before underscore)  _### (# after the _)
3. TG (letters after del)
4. -  (hyphen used in this spot)`    

Data Sample

 Input Variant  Errors  Chromosomal Variant Coding Variant(s)
 NM_004004.5:c.575_576delCA     **NC_000013.10:g.20763145_20763146delTG** NM_004004.5:c.575_576delCA    XM_005266354.1:c.575_576delCA XM_005266355.1:c.575_576delCA XM_005266356.1:c.575_576delCA

Desired Output

13     20763145     20763146     TG     -

Thank you :).

Upvotes: 1

Views: 748

Answers (1)

Kaz
Kaz

Reputation: 58578

TXR Language:

Input Variant@(skip)
@(skip)NC_@{nc-raw}.@(skip)g.@{g-left}_@{g-right}del@{letters 2}@(skip)
@(bind nc-num @(int-str nc-raw))
@(output)
@{nc-num 6} @{g-left 12} @{g-right 12} @{letters 6} -
@(end)

Run:

$ txr nc.txr data
13     20763145     20763146     TG     -

All in the command line:

$ txr -c 'Input Variant@(skip)
@(skip)NC_@{nc-raw}.@(skip)g.@{g-left}_@{g-right}del@{letters 2}@(skip)
@(bind nc-num @(int-str nc-raw))
@(output)
@{nc-num 6} @{g-left 12} @{g-right 12} @{letters 6} -
@(end)' data
13     20763145     20763146     TG     -

Upvotes: 2

Related Questions