Reputation: 23
Basically, I have several files, each one with several lines of text, and I'm interested in finding a specific sequence of 7 letters and count how many time they appear per file using some basic regular expressions on Perl(v5.24.1).
So far no problem, but the "tricky" part is that if one of those seven letters doesn't match my pattern, I would like to count it as well (as long as it's just one).
Patterns I'm looking for:
'CCCAGGT', 'CCCAGTT', 'CCCAGCT', and 'CCCAGAT'.
Examples of non-matching text that I would like to count:
line1 - AGGCTCAGGAG'CCCATGT'GGGCGGACCCAT --> Count as 'CCCAGGT'
line2 - CGGCTCAGGAG'CCCGGGT'GGGCGGTCCCAT --> Count as 'CCCAGGT'
I'm including a piece code (at the bottom) to further explain what I'm searching and what I've thought so far, but it has to be a better way of doing this.
So, do you know if it's possible to "tell" Perl that I can tolerate 1 mismatch in the sequence when using the =~ m/
operator? OR another function to use?
Thanks a lot for your help!
if ($elements[0] =~ m/CCCAGGT/){
$mutg = $mutg + $elements[1];
}
elsif ($elements[0] =~ m/CCCAGTT/){
$mutt = $mutt + $elements[1];
}
elsif ($elements[0] =~ m/CCCAGAT/){
$muta = $muta + $elements[1];
}
elsif ($elements[0] =~ m/CCCAGCT/){
$mutc = $mutc + $elements[1];
}
else {
if ($elements[0] =~ m/.CCAGGT/){
$mutg = $mutg + $elements[1];
}
elsif ($elements[0] =~ m/.CCAGTT/){
$mutt = $mutt + $elements[1];
}
elsif ($elements[0] =~ m/.CCAGAT/){
$muta = $muta + $elements[1];
}
elsif ($elements[0] =~ m/.CCAGCT/){
$mutc = $mutc + $elements[1];
}
else {
[[AGAIN BUT WITH THE "." IN THE SECOND POSITION AND SO ON]]
}
}
Upvotes: 2
Views: 271
Reputation: 385847
To simply check if there's a match (as opposed to finding the nearest match), we could build the patterns (.CCAGGT
, C.CAGGT
, etc) dynamically.
my $target_seq = "CCCAGGT";
my @parts = map quotemeta, split //, $target_seq;
my $fuzzy_pat =
join "|",
map { join("", @parts[0..$_-1], ".", @parts[$_+1..$#parts]) }
0..$#parts;
my $fuzzy_re = qr/$fuzzy_pat/;
This can be extended to check for multiple sequences at once, as long as one doesn't care which sequence is found.
use List::Util qw( uniq );
my @target_seqs = qw( CCCAGGT CCCAGTT CCCAGAT CCCAGCT );
my @fuzzy_pats;
for my $pat (@target_seqs) {
my @parts = map quotemeta, split //, $pat;
for my $i (0..$#parts) {
push @fuzzy_pats, join("", @parts[0..$i-1], ".", @parts[$i+1..$#parts]);
}
}
my $fuzzy_pat = join "|", sort uniq @fuzzy_pats;
my $fuzzy_re = qr/$fuzzy_pat/;
$mtg += $elements[1] if $elements[0] =~ $fuzzy_re;
Upvotes: 3
Reputation: 9231
It may be possible with regex but it will be overly complicated, regex is not designed for fuzzy matching. You may consider Text::Fuzzy. The normal interface would require first somehow turning your lines into possible sequences to consider.
use strict;
use warnings;
use Text::Fuzzy;
my $fuzzy = Text::Fuzzy->new('CCCAGGT', max => 1);
my @matches = $fuzzy->nearestv(\@possible);
The fuzzy_index function may also be useful for searching a larger text string similar to regex, but only returns the closest match within the string.
Upvotes: 4