Is there a way to "allow" a perl RegEx to ignore 1 character at a time when trying to match?

Question

Basically, I have several files, each one with several lines of text, and I'm interested in finding a specific sequence of 7 letters and count how many time they appear per file using some basic regular expressions on Perl(v5.24.1).

So far no problem, but the "tricky" part is that if one of those seven letters doesn't match my pattern, I would like to count it as well (as long as it's just one).

Patterns I'm looking for: 'CCCAGGT', 'CCCAGTT', 'CCCAGCT', and 'CCCAGAT'.

Examples of non-matching text that I would like to count:

line1 - AGGCTCAGGAG'CCCATGT'GGGCGGACCCAT --> Count as 'CCCAGGT'
line2 - CGGCTCAGGAG'CCCGGGT'GGGCGGTCCCAT --> Count as 'CCCAGGT'

I'm including a piece code (at the bottom) to further explain what I'm searching and what I've thought so far, but it has to be a better way of doing this.

So, do you know if it's possible to "tell" Perl that I can tolerate 1 mismatch in the sequence when using the =~ m/ operator? OR another function to use?

Thanks a lot for your help!

  if ($elements[0] =~ m/CCCAGGT/){
    $mutg = $mutg + $elements[1];
  }
  elsif ($elements[0] =~ m/CCCAGTT/){
    $mutt = $mutt + $elements[1];
  }
  elsif ($elements[0] =~ m/CCCAGAT/){
    $muta = $muta + $elements[1];
  }
  elsif ($elements[0] =~ m/CCCAGCT/){
    $mutc = $mutc + $elements[1];
  }
  else {
    if ($elements[0] =~ m/.CCAGGT/){
      $mutg = $mutg + $elements[1];
    }
    elsif ($elements[0] =~ m/.CCAGTT/){
    $mutt = $mutt + $elements[1];
    }
    elsif ($elements[0] =~ m/.CCAGAT/){
      $muta = $muta + $elements[1];
    }
    elsif ($elements[0] =~ m/.CCAGCT/){
      $mutc = $mutc + $elements[1];
    }
    else {
      [[AGAIN BUT WITH THE "." IN THE SECOND POSITION AND SO ON]]
    }                         
  }

ikegami · Accepted Answer

To simply check if there's a match (as opposed to finding the nearest match), we could build the patterns (.CCAGGT, C.CAGGT, etc) dynamically.

my $target_seq = "CCCAGGT";

my @parts = map quotemeta, split //, $target_seq;
my $fuzzy_pat =
   join "|",
      map { join("", @parts[0..$_-1], ".", @parts[$_+1..$#parts]) }
         0..$#parts;

my $fuzzy_re = qr/$fuzzy_pat/;

This can be extended to check for multiple sequences at once, as long as one doesn't care which sequence is found.

use List::Util qw( uniq );

my @target_seqs = qw( CCCAGGT CCCAGTT CCCAGAT CCCAGCT );

my @fuzzy_pats;
for my $pat (@target_seqs) {
   my @parts = map quotemeta, split //, $pat;
   for my $i (0..$#parts) {
      push @fuzzy_pats, join("", @parts[0..$i-1], ".", @parts[$i+1..$#parts]);
   }
}

my $fuzzy_pat = join "|", sort uniq @fuzzy_pats;
my $fuzzy_re = qr/$fuzzy_pat/;

$mtg += $elements[1] if $elements[0] =~ $fuzzy_re;

Is there a way to "allow" a perl RegEx to ignore 1 character at a time when trying to match?

Answers (2)

Related Questions

Is there a way to &quot;allow&quot; a perl RegEx to ignore 1 character at a time when trying to match?

Answers (2)

Related Questions

Is there a way to "allow" a perl RegEx to ignore 1 character at a time when trying to match?