berutti
berutti

Reputation: 23

Perl match with regex a number and as many following characters as the number specifies within a string

I (think I) am quite experienced in Perl, still I have a nasty question I'm trying to solve. I have to match a string (whose format I cannot change coming out from a bioinformatic software) in this format:

[\+\-][0-9]+[ACGTacgt]+

Actually this would be easy, though the number of repeats of the pattern [ACGTacgt] is not quite 1 or more but the number defined by [0-9]+ so it can be

[...whatever...]+2ac[...whatever...]
+4acta
+3atg

etc..

Now to test if the regex work I'm just playing with a substitution and I tried the following way:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{\1}//g

Unfortunately this guy above does not work and I get an error complaining about unescaped braces. Indeed if I define a proper number instead of \1 the thing works:

$mystring =~ s/[\+\-]([0-9]+)[ACGTacgt]{1}//g

I need it to work since the format might contain sequences like ac.,.+2caaa..a.c from which I have to get exactly the +2ca leaving separately from the rest.

Is it possible in one step, or there's a logical reason which I'm missing right now for which it's not possible?

Thanks for any help or suggestions!

berutti

Upvotes: 2

Views: 556

Answers (2)

zdim
zdim

Reputation: 66883

Can iterate over numbers and in the loop body match captured-number of letters that follow

use warnings;
use strict;
use feature 'say';

my $s = q(ac.,.+2caaa..a.c-3acgg+1tt);

while ($s =~ /[+-]([0-9]+)/g) { 
    my $c = $1; 
    $s =~ /\G([acgt]{$c})/i or next;

    say "$c$1";  # or process it further / store it ...
}

The \G assertion makes its regex start from where the previous m//g match ended, as needed. This is a standard approach to "chain global matches" and generally scan text by coordinating multiple regex. See docs for it in Assertions in perlre and, for far more detail, in perlop (search for \G).

Prints

2ca
3acg
1t

If the [+-] need be extracted as well, add capturing parens around it and renumerate captures (that'll be $1 and the number in $2)

Please clarify other requirements -- for instance: Do you only need to extract the patterns or should anything in particular happen with the original string as well?


Update  It's clarified that the matches also need be removed from the string.

An easy way is to simply remove them with another regex, after they have been collected.

After the same processing as above, the collected matches are used to form a pattern with alternation for their removal. This is also efficient since by construction the subpatterns in the alternation come in the order of their appearance in the string

use warnings;
use strict;
use feature 'say';

my $string = q(ac.,.+2caaa..a.c-3acgg+1tt);

my @matches;

while ($string =~ /([+-])([0-9]+)/g) { 
    my ($sign, $count)  = ($1, $2);
    $string =~ /\G([acgt]{$count})/i or next;    
    push @matches, $sign.$count.$1; 
}    
say for @matches;

my $matches_re = '(?:' . join('|', map { quotemeta } @matches) . ')';

$string =~ s/$matches_re//g;    
say $string;

where i've now joined the sign [+-] to the match.

It prints

+2ca
-3acg
+1t
ac.,.aa..a.cgt

Upvotes: 1

Grinnz
Grinnz

Reputation: 9231

The {$N} component of the regex is a modifier, which can't use a backreference as its count. You could work around it with an embedded perl expression:

use strict;
use warnings;
my $string = 'ac.,.+2caaa..a.c';
$string =~ s/[+-]([0-9]+)(??{ "[ACGTacgt]{$1}" })//g;
print "$string\n";

Note that embedded subexpressions are a last resort, and for obvious reasons prevent the regex from being optimized properly - it is IMO an appropriate tradeoff for this exact case where the matched substring must be removed, but if your requirements are slightly different, a split-out iterative approach may be more appropriate.

Upvotes: 3

Related Questions