Reputation: 21
I have some fastaq
files that I need to analyse. The main issue is that the analysis tool I'm currently working with only accept ACTG
as nucleotides and not the rest of nomenclatures in the IUPAC code (R
, W
, etc).
I've made this code to change the specific nucleotides:
awk '{
split($2,a,"") ;
str="" ;
for (n in a) {nucleotide=a[n]} ;
if (nucleotide~/[ACTG]/) {str=str""nucleotide}
else {
if (nucleotide~/[RWMV]/) {str=str""A}
else {
if (nucleotide~/[YD]/) {str=str""C}
else {
if (nucleotide~/[SKN]/) {str=str""G}
else {str=str""T}
}
}
}
}' | head
It is working but it is super slow. Do you know a more efficient way to do it?
Thank you so much!
Upvotes: 2
Views: 69
Reputation: 8174
For this assuming you have fastq
format, I recommend to use a specialized library, biopython
or bioperl
are good options.
cat example.fastq
@ID AGTCGTACTGGACTGYGCSAACTG + IIIIIIIIIIIIIIIIIIIIIIII @ID2 RWMVYDSKNAAAAAAAAAAAAAAA + IIIIIIIIIIIIIIIIIIIIIIII
However, solution using awk
awk 'NR%4==2{gsub(/[RWMV]/,"A"); gsub(/[YD]/,"C"); gsub(/[SKN]/,"G")}1' example.fastq
you get,
@ID AGTCGTACTGGACTGCGCGAACTG + IIIIIIIIIIIIIIIIIIIIIIII @ID2 AAAACCGGGAAAAAAAAAAAAAAA + IIIIIIIIIIIIIIIIIIIIIIII
Upvotes: 3