Toml
Toml

Reputation: 43

awk script to cut/paste a string in a file

i got a file formatted like that : (each space = tab separator)

NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG

i want to cut/paste the :AATGT+GTGTA part at the end of the line, with a tab separator to get

NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA

important precision : i want the last string after the last ':' of the first instance to be copy paste, (included the ':') regardless of the size of the string (it can be AAAA, or AAAA+GGGG, etc.)

i used the following awk script :

awk '/^@/ {print;next} {N=split($1,n,":"); print $0 "\tRX:Z:" n[N] ; sub("[:]"n[N],"") ; print $0}'

my problem is that the original line is still there so i got this result

NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA

basically i don't know how to redirect the results in a new file (or overwrite the original file) with awk. A bash script will also be a good solution for me. Thks for your help

Edit : forget to mention that i had to exclude the lines beginning with @ : the script should not applied to thoses lines. (it's a bam file for NGS datas, the header lines should not be changed)

The file look like this

@SQ     SN:chrY LN:59373566
@RG     ID:1    PL:ILLUMINA     PU:PU   LB:001  SM:TeCoriell
@PG     ID:MarkDuplicates       VN:2.23.7       CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.11 
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11
@PG     ID:GATK PrintReads      VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG     ID:samtools.3   PN:samtools     PP:samtools.2   VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507:CACTC-CCGTC   371     chr1    10257   0       2H48M59H        chr7    128036692       0       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA        AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;      BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H     BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5     PG:Z:MarkDuplicates     RG:
Z:1     BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38
NB551027:724:HTWHHAFXY:2:11110:2230:8695:AGTCT-AAAGT    163     chr1    15596   0       113M    =       15596   113     CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/       BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M       BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ     MD:Z:113        PG:Z:MarkDuplicates     RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN     NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113        XS:i:113

i should get this result

@SQ     SN:chrY LN:59373566
@RG     ID:1    PL:ILLUMINA     PU:PU   LB:001  SM:TeCoriell
@PG     ID:MarkDuplicates       VN:2.23.7       CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.11 
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11
@PG     ID:GATK PrintReads      VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG     ID:samtools.3   PN:samtools     PP:samtools.2   VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507   371     chr1    10257   0       2H48M59H        chr7    128036692       0       ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA        AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;      BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H     BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5     PG:Z:MarkDuplicates     RG:
Z:1     BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695    163     chr1    15596   0       113M    =       15596   113     CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/       BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M       BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ     MD:Z:113        PG:Z:MarkDuplicates     RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN     NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113        XS:i:113 RX:Z:AGTCT-AAAGT

Upvotes: 4

Views: 412

Answers (4)

anubhava
anubhava

Reputation: 785196

You may use this gnu-awk:

awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '(n=split($1, a, /:/)) > 1 {sub(/:[^:\t]+\t/, OFS); sub(/\n$/, ""); $0 = $0 OFS "RX:Z:" a[n]} {ORS=RT} 1; END {print "\n"}' file

@SQ SN:chrY LN:59373566
@RG ID:1    PL:ILLUMINA PU:PU   LB:001  SM:TeCoriell
@PG ID:MarkDuplicates   VN:2.23.7   CL:MarkDuplicates   BARCODE_TAG=RX  DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam   METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam    PN:MarkDuplicates
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa   mem -C  -M  -t  4   -R  @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz   TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa  VN:1.11
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads  VN:3.8-1-0-gf15c1c3ef   CL:readGroup=null   platform=null   number=-1   sample_file=[]  sample_name=[]  simplify=false  no_pg_tag=false
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.11 CL:samtools sort    -o  TeCoriell.bwamem.bam    -l  5   -T  TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX   -@  4   TeCoriell.bwamem.compress.bam
@PG ID:samtools.3   PN:samtools PP:samtools.2   VN:1.11 CL:samtools view    -h  TeCoriell.bwamem.bam
NB551026:723:HTWHHAFXY:3:21602:20054:7507   371 chr1    10257   0   2H48M59H    chr7    128036692   0   ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA    AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4>   SA:Z:chr7,128036692,+,76M33S,60,0;  BC:Z:TGCCACCA+GAGCAGCC  MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM   MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ   NM:i:2  OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E   AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695    163 chr1    15596   0   113M    =   15596   113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG  =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/   BC:Z:TGGCACCA+GAGCAGCA  MC:Z:113M   BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113    PG:Z:MarkDuplicates RG:Z:1  BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0  OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/  AS:i:113    XS:i:113    RX:Z:AGTCT-AAAGT

A more readable version:

awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '
(n=split($1, a, /:/)) > 1 {
   sub(/:[^:\t]+\t/, OFS)
   sub(/\n$/, "")
   $0 = $0 OFS "RX:Z:" a[n]
}
{
   ORS=RT
}
1;
END {
   print "\n"
}' file

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133528

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
/^@/{ next }
match($0,/.*:/){
  part1=substr($0,RSTART,RLENGTH)
  part2=substr($0,RSTART+RLENGTH)
  match(part2,/[^ ]*/)
  print part1 substr(part2,RSTART+RLENGTH) "\tRX:Z:" substr(part2,RSTART,RLENGTH)
}' Input_file

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163362

With gnu awk, you might also use match with 3 capture groups.

The third param a contains the group values which you can access by index.

In the replacement you can reorder them and add "\tRX:Z:" between group 3 and 2.

gawk 'match($0, /(.*):([^:\t]+)(\t.*)/, a) {print a[1]a[3]"\tRX:Z:"a[2]}' file

The pattern matches

  • (.*): Group 1, match before the last :
  • ([^:\t]+) Group 2, match 1 or more occurrences of any char except : or a tab
  • (\t.*) Group 3, match a tab and the rest of the line

Output

NB551027:767:H73JMAFX2:1:11101:5356:1093    blabla  LASTTAG RX:Z:AATGT+GTGTA

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203625

$ awk '{x=$1; sub(/.*:/,"",x); sub(/:[^:\t]*\t/,"\t"); print $0 "\tRX:Z:" x}' file
NB551027:767:H73JMAFX2:1:11101:5356:1093        blabla  LASTTAG RX:Z:AATGT+GTGTA

Upvotes: 2

Related Questions