Perl match with regex a number and as many following characters as the number specifies within a string

Question

I (think I) am quite experienced in Perl, still I have a nasty question I'm trying to solve. I have to match a string (whose format I cannot change coming out from a bioinformatic software) in this format:

[\+\-][0-9]+[ACGTacgt]+

Actually this would be easy, though the number of repeats of the pattern [ACGTacgt] is not quite 1 or more but the number defined by [0-9]+ so it can be

[...whatever...]+2ac[...whatever...]
+4acta
+3atg

etc..

Now to test if the regex work I'm just playing with a substitution and I tried the following way:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{\1}//g

Unfortunately this guy above does not work and I get an error complaining about unescaped braces. Indeed if I define a proper number instead of \1 the thing works:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{1}//g

I need it to work since the format might contain sequences like ac.,.+2caaa..a.c from which I have to get exactly the +2ca leaving separately from the rest.

Is it possible in one step, or there's a logical reason which I'm missing right now for which it's not possible?

Thanks for any help or suggestions!

berutti

Grinnz · Accepted Answer

The {$N} component of the regex is a modifier, which can't use a backreference as its count. You could work around it with an embedded perl expression:

use strict;
use warnings;
my $string = 'ac.,.+2caaa..a.c';
$string =~ s/[+-]([0-9]+)(??{ "[ACGTacgt]{$1}" })//g;
print "$string
";

Note that embedded subexpressions are a last resort, and for obvious reasons prevent the regex from being optimized properly - it is IMO an appropriate tradeoff for this exact case where the matched substring must be removed, but if your requirements are slightly different, a split-out iterative approach may be more appropriate.

Perl match with regex a number and as many following characters as the number specifies within a string

Answers (2)

Related Questions