Reputation: 2319
I'm trying to find the longest common substring (LCS) across multiple sequences.
There are a number of modules on CPAN that implement the LCS algorithm for 2 sequences, such as Algorithm::Diff and String::LCSS_XS, but I'm having a hard time in extending them to work with more than 2 sequences because the LCS across multiple sequences is not necessarily a LCS between any two of them.
Of note, despite its name, the Algorith::MLCS doesn't actually return the LCS but all the common elements (also non-consecutive) of a number of arrays. My impression is that it is broken by design, but I might be wrong.
Algorithm::Diff and Algorith::MLCS solve the longest common subsequence problem, not the longest common substring one.
Is there an obvious way to extend the n=2 algorithms or do I have to implement my version? If yes, how?
Thanks.
Upvotes: 1
Views: 1055
Reputation: 2319
This can be solved pretty easily using the Tree::Suffix module.
Example:
#!/usr/bin/env perl
use Modern::Perl;
use Bio::SeqIO;
use Tree::Suffix;
my $seqio = Bio::SeqIO->new(
-file => "fasta_sequences.txt",
-format => "Fasta");
my @seqs;
while (my $seqobj = $seqio->next_seq) {
push @seqs, $seqobj->seq;
}
my $tree = Tree::Suffix->new(@seqs);
my @lcss = $tree->lcs;
say $_ for @lcss;
Upvotes: 1