Reputation: 23
I (think I) am quite experienced in Perl, still I have a nasty question I'm trying to solve. I have to match a string (whose format I cannot change coming out from a bioinformatic software) in this format:
[\+\-][0-9]+[ACGTacgt]+
Actually this would be easy, though the number of repeats of the pattern [ACGTacgt]
is not quite 1 or more but the number defined by [0-9]+
so it can be
[...whatever...]+2ac[...whatever...]
+4acta
+3atg
etc..
Now to test if the regex work I'm just playing with a substitution and I tried the following way:
$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{\1}//g
Unfortunately this guy above does not work and I get an error complaining about unescaped braces. Indeed if I define a proper number instead of \1 the thing works:
$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{1}//g
I need it to work since the format might contain sequences like ac.,.+2caaa..a.c
from which I have to get exactly the +2ca
leaving separately from the rest.
Is it possible in one step, or there's a logical reason which I'm missing right now for which it's not possible?
Thanks for any help or suggestions!
berutti
Upvotes: 2
Views: 556
Reputation: 66883
Can iterate over numbers and in the loop body match captured-number of letters that follow
use warnings;
use strict;
use feature 'say';
my $s = q(ac.,.+2caaa..a.c-3acgg+1tt);
while ($s =~ /[+-]([0-9]+)/g) {
my $c = $1;
$s =~ /\G([acgt]{$c})/i or next;
say "$c$1"; # or process it further / store it ...
}
The \G
assertion makes its regex start from where the previous m//g
match ended, as needed. This is a standard approach to "chain global matches" and generally scan text by coordinating multiple regex. See docs for it in Assertions in perlre and, for far more detail, in perlop (search for \G
).
Prints
2ca 3acg 1t
If the [+-]
need be extracted as well, add capturing parens around it and renumerate captures (that'll be $1
and the number in $2
)
Please clarify other requirements -- for instance: Do you only need to extract the patterns or should anything in particular happen with the original string as well?
Update It's clarified that the matches also need be removed from the string.
An easy way is to simply remove them with another regex, after they have been collected.
After the same processing as above, the collected matches are used to form a pattern with alternation for their removal. This is also efficient since by construction the subpatterns in the alternation come in the order of their appearance in the string
use warnings;
use strict;
use feature 'say';
my $string = q(ac.,.+2caaa..a.c-3acgg+1tt);
my @matches;
while ($string =~ /([+-])([0-9]+)/g) {
my ($sign, $count) = ($1, $2);
$string =~ /\G([acgt]{$count})/i or next;
push @matches, $sign.$count.$1;
}
say for @matches;
my $matches_re = '(?:' . join('|', map { quotemeta } @matches) . ')';
$string =~ s/$matches_re//g;
say $string;
where i've now joined the sign [+-]
to the match.
It prints
+2ca -3acg +1t ac.,.aa..a.cgt
Upvotes: 1
Reputation: 9231
The {$N}
component of the regex is a modifier, which can't use a backreference as its count. You could work around it with an embedded perl expression:
use strict;
use warnings;
my $string = 'ac.,.+2caaa..a.c';
$string =~ s/[+-]([0-9]+)(??{ "[ACGTacgt]{$1}" })//g;
print "$string\n";
Note that embedded subexpressions are a last resort, and for obvious reasons prevent the regex from being optimized properly - it is IMO an appropriate tradeoff for this exact case where the matched substring must be removed, but if your requirements are slightly different, a split-out iterative approach may be more appropriate.
Upvotes: 3