Anna Delgado
Anna Delgado

Reputation: 21

Substituing specific nucleotides in FastaQ files in Linux

I have some fastaq files that I need to analyse. The main issue is that the analysis tool I'm currently working with only accept ACTG as nucleotides and not the rest of nomenclatures in the IUPAC code (R, W, etc).

I've made this code to change the specific nucleotides:

awk '{
    split($2,a,"") ; 
    str="" ; 
    for (n in a) {nucleotide=a[n]} ; 
    if (nucleotide~/[ACTG]/) {str=str""nucleotide} 
    else {
        if (nucleotide~/[RWMV]/) {str=str""A} 
        else {
            if (nucleotide~/[YD]/) {str=str""C} 
            else {
                if (nucleotide~/[SKN]/) {str=str""G} 
                else {str=str""T}
            }
        }
    }
}' | head

It is working but it is super slow. Do you know a more efficient way to do it?

Thank you so much!

Upvotes: 2

Views: 69

Answers (1)

Jose Ricardo Bustos M.
Jose Ricardo Bustos M.

Reputation: 8174

For this assuming you have fastq format, I recommend to use a specialized library, biopython or bioperl are good options.

cat example.fastq

@ID
AGTCGTACTGGACTGYGCSAACTG
+
IIIIIIIIIIIIIIIIIIIIIIII
@ID2
RWMVYDSKNAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIII

However, solution using awk

awk 'NR%4==2{gsub(/[RWMV]/,"A"); gsub(/[YD]/,"C"); gsub(/[SKN]/,"G")}1' example.fastq

you get,

@ID
AGTCGTACTGGACTGCGCGAACTG
+
IIIIIIIIIIIIIIIIIIIIIIII
@ID2
AAAACCGGGAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIII

Upvotes: 3

Related Questions