dstorey
dstorey

Reputation: 104

Regular expression Capture and Backrefence

Here's the string I'm searching.

T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG

I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.

ie the capture should return:

+4ACCG
+12AAGTACTACCGT
etc.

and :

ACCG
AAGTACTACCGT
etc.

Here's the regex I'm using:

(\+(\d+)([ATGCatgcnN]){\2});

and I'm using $1 and $3 for the captures.

What am I missing ?

Upvotes: 2

Views: 156

Answers (3)

Ersun Warncke
Ersun Warncke

Reputation: 51

my @sequences = split(/\+/, $string);

for my $seq (@sequences) {
    my($bases) = $seq =~ /([^\d]+)/;
}

Upvotes: 0

Chris Charley
Chris Charley

Reputation: 6573

This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.

$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';

while (/(\d+)/g) {
    my $dig = $1;
    /\G([TAGCN]{$dig})/i;
    say $1;
}

The results are

ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG

I think this is correct but not sure :-|

Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.

Upvotes: 1

stema
stema

Reputation: 92986

You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).

So try:

(\+\d+([ATGCatgcnN]+));

and find the complete match in $1 and the letters in $2

Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.

You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:

/(\+\d+([ATGCN]+))/gi

Upvotes: 3

Related Questions