kajendiran
kajendiran

Reputation: 39

Compute all possible sequences up to a given length

My question relates to bioinformatics, specifically protein sequences however, no biological knowledge is really needed. I am trying to find an efficient way of solving this problem in Perl:

Protein sequences are basically sequences or strings which vary in length and are composed of combinations of 20 amino acids or characters.

At a length of 1, there would be thus 20 possibilities. The issue is that with every increment of 1 character, the number of possibilities increases substantially.

I wanted to compute another calculation on every sequence of every length. Protein sequences can be many hundreds and even thousands of amino acids. I just need to get all the possible sequences to do this.

edit: I realise that it is impossible to compute for every length, I do not need to do this, but I wanted to do it for a sensible length that would not take anywhere near the length of the universe.

Any suggestions on the most efficient way to code this?

edit: I do not really need to do this for sequences of 1000, I was just interested on ideas, resources, functions etc that I am not aware of that may help me understand the most efficient way to do this.

Upvotes: 1

Views: 1882

Answers (4)

Borodin
Borodin

Reputation: 126722

The Math::Combinatorics module that has been recommended doesn't support permutations with replacement, which is what you want for this problem as otherwise your proteins can never be longer than twenty amino-acids.

Algorithm::Combinatorics will do the job, and is written partially in C so it should perform well.

Here is an example that generates all pairs of amino-acids. I have shown only the first few lines of output as even this produces 400 variations!

use strict;
use warnings;

use Algorithm::Combinatorics 'variations_with_repetition';

my @acids = qw/ ala arg asn asp cys gln glu gly his ile leu lys met phe pro ser thr trp tyr val /;

my @proteins = variations_with_repetition(\@acids, 2);

print "@$_\n" for @proteins;

output

ala ala
ala arg
ala asn
ala asp
ala cys
ala gln
ala glu
ala gly
ala his
ala ile
ala leu
ala lys
ala met
ala phe
ala pro
ala ser
ala thr
ala trp
ala tyr
ala val
arg ala
arg arg
arg asn
arg asp
arg cys
arg gln
arg glu
arg gly
...

Upvotes: 4

Thor
Thor

Reputation: 47099

To generate permutations in perl I usually turn to Math::Combinatorics, here's a program snippet that returns all permutations of 1, 2, 3, one at a time:

#!/usr/bin/perl -l

use Math::Combinatorics;

$, = " ";

@n = (1 .. 3);
$permuter = Math::Combinatorics->new(data => \@n);

while(@perm = $permuter->next_permutation())
{
  print @perm;
}

Output:

1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1

But heed the advice from the other answers, this is an exponentially growing problem as it is stated, so you need some way of limiting your search space.

Upvotes: -1

abought
abought

Reputation: 2680

Given that your phrasing involves every sequence of every known length, this problem would never converge to a reasonable result- you'd keep going to a length of infinity. Furthermore, your calculation would include a lot of sequences with no relation to reality, or comparisons between dipeptides and gigundous molecules. Even if you limited your calculation to the length of the largest known protein (titin, ~34,350 amino acids), it would still be a prohibitively expensive calculation.

As an alternate proposal: have you considered limiting it to proteins that are actually known to exist, or could be predicted from genetic databases? This would reduce the amount of work to a few thousand sequences of biological relevance, and for most bioinformatics applications genetic or sequence data is widely available from well-structured databases.

Upvotes: 2

Emil Vikström
Emil Vikström

Reputation: 91922

20^1000 is a really large number. You say you need to do some calculation for every sequence, which is not really possible without scaling out to multiple computers. Even at one million calculations a second it would take you many times the age of the universe to finish your calculations.

Upvotes: 3

Related Questions