justaguy
justaguy

Reputation: 3022

awk to print specific fields based on regex match in file

I am trying to use awk to look in input for keywords and in found print specified fields. The awk below does run but does not produce the desired output. What is supposed to happen is that if TYPE=ins or TYPE=del is found in the line then $1,$2,$4,$5, and the LEN= prints. The LEN= is also a field in the line with a number after the =. Thank you :).

input

chr1    1647893 .   C   CTTTCTT 31.9545 PASS    AF=0.330827;AO=179;DP=695;FAO=132;FDP=399;FR=.;FRO=267;FSAF=67;FSAR=65;FSRF=124;FSRR=143;FWDB=0.0145873;FXX=0.00249994;HRUN=1;LEN=6;MLLD=190.481;OALT=TTTCTT;OID=.;OMAPALT=CTTTCTT;OPOS=1647894;OREF=-;PB=0.5;PBP=1;QD=0.320346;RBI=0.0146526;REFB=-0.0116875;REVB=0.00138131;RO=471;SAF=85;SAR=94;SRF=236;SRR=235;SSEN=0;SSEP=0;SSSB=-0.0324817;STB=0.528856;STBP=0.43;TYPE=ins;VARB=0.0222858 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:31:695:399:471:267:179:132:0.330827:94:85:236:235:65:67:124:143
chr1    1650787 .   T   C   483.012 PASS    AF=0.39;AO=181;DP=459;FAO=156;FDP=400;FR=.;FRO=244;FSAF=100;FSAR=56;FSRF=162;FSRR=82;FWDB=-0.00931067;FXX=0;HRUN=1;LEN=1;MLLD=210.04;OALT=C;OID=.;OMAPALT=C;OPOS=1650787;OREF=T;PB=0.5;PBP=1;QD=4.83012;RBI=0.018986;REFB=-0.0114993;REVB=-0.0165463;RO=276;SAF=116;SAR=65;SRF=184;SRR=92;SSEN=0;SSEP=0;SSSB=-0.0305478;STB=0.515311;STBP=0.652;TYPE=snp;VARB=0.019956  GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:483:459:400:276:244:181:156:0.39:65:116:184:92:56:100:162:82
chr1    17034455    .   CGCGCGCGT   C   50  PASS    AF=0.205882;AO=56;DP=272;FR=.;LEN=8;OALT=-;OID=.;OMAPALT=C;OPOS=17034456;OREF=GCGCGCGT;RO=216;SAF=27;SAR=29;SRF=112;SRR=104;TYPE=del    GT:GQ:DP:RO:AO:SAF:SAR:SRF:SRR:AF   0/1:99:272:216:56:27:29:112:104:0.205882

awk

awk '/TYPE=ins/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} /TYPE=del/ {print $1,$2,$4,$5, "/TYPE=*/" "/LEN=*/" $0;next} 1' input > out

desired output

chr1    1647893 C   CTTTCTT TYPE=ins LEN=6
chr1    17034455 CGCGCGCGT  C TYPE=del LEN=8

Upvotes: 1

Views: 124

Answers (2)

anubhava
anubhava

Reputation: 786289

You can use this awk command:

awk 'function find(str) {
   return substr($0, match($0, str "=[^; \t]+"), RLENGTH);
}
/TYPE=(ins|del)/ {
   print $1, $2, $4, $5, find("TYPE"), find("LEN")
}' file

Output:

chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Upvotes: 1

F. Knorr
F. Knorr

Reputation: 3065

Here is an awk-solution:

awk '$0~"TYPE=del" || $0~"TYPE=ins"{max=split($0,ar,";")
                          len=""
                          type=""
                          for(i=1; i<=max; i++){
                              if(ar[i]~"LEN="){len=ar[i]}
                              if(ar[i]~"TYPE="){type=ar[i]}
                          }
                          print $1,$2,$4,$5,type,len}' input

Output:

chr1 1647893 C CTTTCTT TYPE=ins LEN=6
chr1 17034455 CGCGCGCGT C TYPE=del LEN=8

Upvotes: 1

Related Questions