Borys Ostapienko
Borys Ostapienko

Reputation: 79

Find longest repeating string based on input in perl (using subroutines)

So I'm trying to find the longest repeat for a specific pattern thats given. My code so far looks like this, and is fairly close, however it does not fully give the wanted result:

use warnings;
use strict;    

my $DNA;       
$DNA = "ATATCCCACTGTAGATAGATAGAATATATATATATCCCAGCT" ;
print "$DNA\n" ;
print "The longest AT repeat is " . longestRepeat($DNA, "AT") . "\n" ;
print "The longest TAGA repeat is " . longestRepeat($DNA, "TAGA") . "\n" ;
print "The longest C repeat is " . longestRepeat($DNA, "C") . "\n" ;

sub longestRepeat{

  my $someSequence = shift(@_);  # shift off the first  argument from the list
  my $whatBP       = shift(@_);  # shift off the second argument from the list
  my $match = 0;



        if ($whatBP eq "AT"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;

        }
        if ($whatBP eq "TAGA"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;
        }

        if ($whatBP eq "C"){
            while ($someSequence =~ m/$whatBP/g) {
            $match = $match + 1;
            }
            return $match;
        }
}   

What its doing right now is just finding the amount of TOTAL AT's, TAGA's, C's in the sequence. Instead of only giving me the length of the longest one, it sums them up and gives me the total. I think there is something wrong in the while loop, however I am unsure of that. Any help would be quite appreciated.

p.s. It also should display the longest repeat in string form, not number form (probably use of substr here).

Upvotes: 0

Views: 651

Answers (2)

Zaid
Zaid

Reputation: 37146

(pasted this from a duplicate of this question)

Based on the name of your subroutine, I'm assuming that you want to find the longest repeat sequence in your sequence.

If so, how about the following:

sub longest_repeat {

    my ( $sequence, $what ) = @_;

    my @matches = $sequence =~ /((?:$what)+)/g ;  # Store all matches

    my $longest;
    foreach my $match ( @matches ) {  # Could also avoid temp variable :
                                      # for my $match ( $sequence =~ /((?:$what)+)/g )

        $longest //= $match ;         # Initialize
                                      #  (could also do `$longest = $match
                                      #                    unless defined $match`)

        $longest = $match if length( $longest ) < length( $match );
    }

    return $longest;  # Note this also handles the case of no matches
}

If you can digest that, the following version achieves essentially the same functionality with a Schwartzian transform:

sub longest_repeat {

    my ( $sequence, $what ) = @_;                          # Example:
                                                           # --------------------
    my ( $longest ) = map { $_->[0] }                      # 'ATAT' ...
                        sort { $b->[1] <=> $a->[1] }       # ['ATAT',4], ['AT',2]
                          map { [ $_, length($_) ] }       # ['AT',2], ['ATAT',4]
                            $sequence =~ /((?:$what)+)/g ; # ... 'AT', 'ATAT'

    return $longest ;
}

Some may argue that it is wasteful to sort because it is O(n.log(n)) instead of O(n) but there's variety for ya.

Upvotes: 0

Aaron Miller
Aaron Miller

Reputation: 3780

There's no need for your longestRepeat function to check which of the three cases it's handling -- in general, when you find you've written exactly the same instructions multiple times, it's a hint that you can factor out the repetition and simplify your program thereby. Consider the following, which I've cleaned up for functionality and commented for illustrative purposes:

#!/usr/bin/env perl
use warnings;
use strict;    

# no need to declare and define separately; this works fine
# also no need for space before semicolon
my $DNA = "ATATCCCACTGTAGATAGATAGAATATATATATATCCCAGCT";
print "$DNA\n";
print "The longest AT repeat is " . longestRepeat($DNA, "AT") . "\n";
print "The longest TAGA repeat is " . longestRepeat($DNA, "TAGA") . "\n";
print "The longest C repeat is " . longestRepeat($DNA, "C") . "\n";

sub longestRepeat {

  # note that, within a function, @_ is the default argument to shift();
  # hence its absence in the next two lines. (in practice, you're more 
  # likely to see 'shift' in this context without even parentheses, much
  # less the full 'shift(@_)'; be prepared to run into it.)
  my $sequence = shift(); # take the first argument
  my $kmer = shift(); # take the second argument

  # these state variables we'll use to keep track of what we're doing here;
  # $longest_match, a string, will eventually be returned.
  my $longest_matchlen = 0;
  my $longest_match = '';

  # for each match in $sequence of one or more $kmer repeats...
  while ($sequence =~ m@($kmer)+@g) {

    # ...get the length of the match, stored in $1 by the parenthesized
    # capture group, with the '+' quantifier grabbing the longest match 
    # available from each starting point (see `man perlre' for more)...
    my $this_matchlen = length($1);

    # ...and if this match is longer than the longest yet found...
    if ($this_matchlen > $longest_matchlen) {

      # ...store this match's length in $longest_matchlen...
      $longest_matchlen = $this_matchlen;

      # ...and store the match itself in $longest_match.
      $longest_match = $1;

    }; # end of the 'if' statement

  }; # end of the 'while' loop

  # at this point, the longest match we found is in $longest_match; if
  # we found no matches, then $longest_match still contains the empty
  # string we assigned up there before the while loop started, which is
  # the correct result in a case where $kmer never appears in $sequence.
  return $longest_match;
};

You're studying bioinformatics, aren't you? I have some experience of teaching Perl to bioinformaticians, and I gather there is an extremely broad distribution of programming skill and talent in the field, with a rather unfortunate hump toward the left-hand side of the graph -- a polite way of saying that, as a professional programmer, most of the bioinformatics Perl code I've seen has ranged from not very good to quite poor indeed.

I mention this with no intent to give insult, but only to substantiate my very strong recommendation that you include some computer science courses in whatever curriculum you're currently pursuing; the more exposure you can give yourself to the general concepts and habits of thinking which are involved in the accurate formulation of algorithms, the more prepared you'll be to tackle the requirements imposed by your field -- indeed, more prepared than most, in my experience; while I'm not a bioinformatician myself, in working with people who are, it seems to me that a strong programming background may well be more useful to a bioinformatician than a strong background in biology.

Upvotes: 2

Related Questions