count the number of repeats of a set of characters

Question

I have a .fa file with following strings:

NP_009339.1 NP_009339.1 glutamate dehydrogenase (NADP(+)) GDH3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXXX
XXBBBBBBBBBBBBBBBBBXXXXXBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXXX XX

gi|10383797|ref|NP_009965.2| Rbk1p [Saccharomyces cerevisiae S288c]
AAAAAAAAAAAAAAAAAAAAAAAXXXXAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXXX
XBBBBBBBBBBBBBBBBBBBXX XXBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

[Note that at the end of first line of first sequence the XXXXX stretch is separated by return and in second line XXXXX is separated by white space, I wish to count them too.] Could anyone help me find/print the number of XXXXX stretches present in this file and print the entire sequence on output.fa. Got exhausted struggling with "chomp" to ignore/whitespace.

Here is my script:

#!/usr/bin/perl
use warnings;
use strict;    
open my $fh , '<' , 'input.fa' or die 'Cannot open file';
my $Count_XXXXX=0;
while (<$fh>){
chomp;
$Count_XXXXX+=s/X{5}//g;
}
close $fh;
print "
Total no of repeats:".$Count_XXXXX."
";

Miller · Accepted Answer

The easiest method is to just strip out the whitespace if that's what you want. The following inputs your sequences in paragraph mode, and then processes the $data:

use strict;
use warnings;

local $/ = "

";

while () {
    chomp;
    my ($label, $data) = split "
", $_, 2;
    $data =~ s/\s+//g;

    my $count = () = $data =~ m/X{5,}/g;

    print "$count
";
}

__DATA__
NP_009339.1 NP_009339.1 glutamate dehydrogenase (NADP(+)) GDH3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXXX
XXBBBBBBBBBBBBBBBBBXXXXXBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXXX XX

gi|10383797|ref|NP_009965.2| Rbk1p [Saccharomyces cerevisiae S288c]
AAAAAAAAAAAAAAAAAAAAAAAXXXXAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXXX
XBBBBBBBBBBBBBBBBBBBXX XXBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Outputs:

3
0

Update

To capture all of the matches, just assign the result of the regular expression to an array:

my $count = my @matches = $data =~ m/X{5,}/g;

Note, I intentionally made the match pull 5 or more X's, because I assumed that 10 X's in a row should be counted as a single match and not 2 matches.

count the number of repeats of a set of characters

Answers (1)

Related Questions