Reputation: 43
i got a file formatted like that : (each space = tab separator)
NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
i want to cut/paste the :AATGT+GTGTA part at the end of the line, with a tab separator to get
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
important precision : i want the last string after the last ':' of the first instance to be copy paste, (included the ':') regardless of the size of the string (it can be AAAA, or AAAA+GGGG, etc.)
i used the following awk script :
awk '/^@/ {print;next} {N=split($1,n,":"); print $0 "\tRX:Z:" n[N] ; sub("[:]"n[N],"") ; print $0}'
my problem is that the original line is still there so i got this result
NB551027:767:H73JMAFX2:1:11101:5356:1093:AATGT+GTGTA blabla LASTTAG
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
basically i don't know how to redirect the results in a new file (or overwrite the original file) with awk. A bash script will also be a good solution for me. Thks for your help
Edit : forget to mention that i had to exclude the lines beginning with @ : the script should not applied to thoses lines. (it's a bam file for NGS datas, the header lines should not be changed)
The file look like this
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507:CACTC-CCGTC 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38
NB551027:724:HTWHHAFXY:2:11110:2230:8695:AGTCT-AAAGT 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113
i should get this result
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551027:724:HTWHHAFXY:3:21602:20054:7507 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113 RX:Z:AGTCT-AAAGT
Upvotes: 4
Views: 412
Reputation: 785196
You may use this gnu-awk
:
awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '(n=split($1, a, /:/)) > 1 {sub(/:[^:\t]+\t/, OFS); sub(/\n$/, ""); $0 = $0 OFS "RX:Z:" a[n]} {ORS=RT} 1; END {print "\n"}' file
@SQ SN:chrY LN:59373566
@RG ID:1 PL:ILLUMINA PU:PU LB:001 SM:TeCoriell
@PG ID:MarkDuplicates VN:2.23.7 CL:MarkDuplicates BARCODE_TAG=RX DUPLEX_UMI=true INPUT=[TeCoriell.bwamem.bam OUTPUT=/TeCoriell/TeCoriell.bwamem.compress.recalibration.realignment.bam METRICS_FILE=/TeCoriell/TeCoriell.bwamem.bam PN:MarkDuplicates
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/tools/bwa/current/bin/bwa mem -C -M -t 4 -R @RG\tID:1\tPL:ILLUMINA\tPU:PU\tLB:001\tSM:TeCoriell TeCoriell.R1.fastq.gz TeCoriell.R2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.11
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11
@PG ID:GATK PrintReads VN:3.8-1-0-gf15c1c3ef CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.11 CL:samtools sort -o TeCoriell.bwamem.bam -l 5 -T TeCoriell.bwamem.compress.bam.SAMTOOLS_PREFIX -@ 4 TeCoriell.bwamem.compress.bam
@PG ID:samtools.3 PN:samtools PP:samtools.2 VN:1.11 CL:samtools view -h TeCoriell.bwamem.bam
NB551026:723:HTWHHAFXY:3:21602:20054:7507 371 chr1 10257 0 2H48M59H chr7 128036692 0 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCCATCCCCACCCCCA AEEF>86/>G)-,F><)C;
>/G7D,FF28<.1FFGDAF>F@BEECA4> SA:Z:chr7,128036692,+,76M33S,60,0; BC:Z:TGCCACCA+GAGCAGCC MC:Z:76M33H BD:Z:MJOMLMMJNMLLLJOMLMMIOMLMMIOMLLMJJMOOMJJMNNKKJMMM MD:Z:36A5A5 PG:Z:MarkDuplicates RG:
Z:1 BI:Z:RNQRQRRNQRQRRNQRQRRNQRQRRNQRQRRNNQSSQNNQQRNNNQQQ NM:i:2 OQ:Z:EEEEE<</AE///EA</EAA/EAE/EE6AA//EEEEEEAEAEEEEE/E AS:i:38 XS:i:38 RX:Z:CACTC-CCGTC
NB551027:724:HTWHHAFXY:2:11110:2230:8695 163 chr1 15596 0 113M = 15596 113 CAGGAAGGAGCCATAGCCCAGGCAGGAGGGCTGAGGACCTCTGGTGGCGGCCCAGGGCTTCCAGCATGTGCCCTAGGGGAAGCAGGGGCCA
GCTGGCAAGAGCAGGGGGTGGG =>BB?@F>AGCBAB>GCBCBGFDBGFAGFFDEGAG>@DCEFEFEBGECAECBBAFEECDCEBA@CABFAFCB8D>FEEE@A@CA@DE?BA9E2CE>B@:E??B@?>CDD;DC/ BC:Z:TGGCACCA+GAGCAGCA MC:Z:113M BD:Z:MMOONPOOMONKLNLLMJILNN
JMNNLNNIKMMNNNLMLLLLLMLLLJLJJJILNNJKNMLMMONMNOLMLJJMLMOJJMOMNNOOJJKKMONNLMKNNNOONNNOJJJMMMJ MD:Z:113 PG:Z:MarkDuplicates RG:Z:1 BI:Z:QQRQQQSQQRSPPSQRSPLPRQQSRQQRQNQSSRRQQRQSQRSQRQQQRMQPLPRQNQSRQT
PRSSSSQQSPLSRRQNNQQSSSRQNNQPPRSSSQQSQSRRSSRQNNNRQQN NM:i:0 OQ:Z:EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEEEEEAEEAEEAEEAE6EEAEEAEEAEEAAEEEAEE/ AS:i:113 XS:i:113 RX:Z:AGTCT-AAAGT
A more readable version:
awk -v RS='(^|\n)(@|NB)' -F "[[:space:]]+" -v OFS='\t' '
(n=split($1, a, /:/)) > 1 {
sub(/:[^:\t]+\t/, OFS)
sub(/\n$/, "")
$0 = $0 OFS "RX:Z:" a[n]
}
{
ORS=RT
}
1;
END {
print "\n"
}' file
Upvotes: 2
Reputation: 133528
With your shown samples, could you please try following. Written and tested in GNU awk
.
awk '
/^@/{ next }
match($0,/.*:/){
part1=substr($0,RSTART,RLENGTH)
part2=substr($0,RSTART+RLENGTH)
match(part2,/[^ ]*/)
print part1 substr(part2,RSTART+RLENGTH) "\tRX:Z:" substr(part2,RSTART,RLENGTH)
}' Input_file
Upvotes: 1
Reputation: 163362
With gnu
awk, you might also use match with 3 capture groups.
The third param a
contains the group values which you can access by index.
In the replacement you can reorder them and add "\tRX:Z:"
between group 3 and 2.
gawk 'match($0, /(.*):([^:\t]+)(\t.*)/, a) {print a[1]a[3]"\tRX:Z:"a[2]}' file
The pattern matches
(.*):
Group 1, match before the last :
([^:\t]+)
Group 2, match 1 or more occurrences of any char except :
or a tab(\t.*)
Group 3, match a tab and the rest of the lineOutput
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
Upvotes: 1
Reputation: 203625
$ awk '{x=$1; sub(/.*:/,"",x); sub(/:[^:\t]*\t/,"\t"); print $0 "\tRX:Z:" x}' file
NB551027:767:H73JMAFX2:1:11101:5356:1093 blabla LASTTAG RX:Z:AATGT+GTGTA
Upvotes: 2