Reputation: 727

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:

Input file:

>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC

The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.

Please let me know if this task can be performed by an already available tool.

Thanks for your help. All the best, Bernardo

Edit (due to the confusion generated with my post):

Please forget the original post and sequence1 and sequence2 too.

Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.

I would like to know for which amino-acid the following sequences are coding for:

AGAGAG
GAGAGA
CTCTCT
TCTCTC

The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.

Thanks again, Bernardo

Upvotes: 1

Answers (2)

Steve

Reputation: 54392

Here's some code that should at least get you started. For example, you can run like:

./retrieve_coding_aa.pl file.fa ATAGTA

Contents of retrieve_coding_aa.pl:

#!/usr/bin/perl 

use strict;
use warnings;

use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;

my $pattern = $ARGV[1];

my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');

while (my $seq = $fasta->next_seq ) {

    my $pos = 0;

    my %counts;

    for (split /($pattern)/ => $seq->seq) {

        if ($_ eq $pattern) {

            my $dist = $pos % 3;

            unless ($dist == 0) {

                my $num = 3 - $dist;

                s/.{$num}//;

                chop until length () % 3 == 0;
            }

            my $table = Bio::Tools::CodonTable->new();

            $counts{$_}++ for split (//, $table->translate($_));
        }

        $pos += length;
    }

    print $seq->display_id() . ":\n";

    map {

        print "$_ => $counts{$_}\n"
    }
    sort {

        $counts{$a} <=> $counts{$b}
    }
    keys %counts;

    print "\n";
}

Here are the results using the sample input:

sequence1:
S => 1

sequence2:
V => 1
I => 1

The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:

$table = Bio::Tools::CodonTable->new( -id => 5 );

or:

$table->id(5);

For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable

Upvotes: 1

Robbert Koppenol

Reputation: 103

I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?) I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.

I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).

input_seq.txt:

GC
ATA
ATAGTA
ATGATA

dna.txt:

>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC

results.txt:

GC,6
ATA,2
ATAGTA,2
ATGATA,1

Code is awk calling another awk (but one of them is simple). You have to run "./match_patterns.awk input_seq.txt" to get the results file generated.:

*match_patterns.awk:*

#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}

test.awk:

#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}

Files have to be all in the same directory.

Good luck!

Upvotes: 0

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

Answers (2)

Related Questions