nferr
nferr

Reputation: 83

Regular Expressions Matching on multiple separated characters

In this string:

"<0> <<1>> <2>> <3> <4>"

I want to match all instances of "<\d{1,2}>" except those I have escaped with an extra set of triangle brackets, e.g., I want to match 0,2,3,4 but not 1, e.g.:

"<0> <<1>> <2>> <3> <4>"

I want to do this in one single regular expression but the best I could get is:

(^|[^\<])\<(?<1>\d{1,2})>([^>]|$)

Which will match 0,3,4 but not 2, e.g.:

"<0> <<1>> <2>> <3> <4>"

Does anyone know how this can be done with a single regular expression?

Upvotes: 4

Views: 319

Answers (6)

Brad Gilbert
Brad Gilbert

Reputation: 34120

Here is a quick and easy way to do this with Perl.

use strict;
use warnings;

my $str = "<0> <<1>> <2>> <3> <4>";
my @array = grep {defined $_} $str =~ /<<\d+>>|<(\d+)>/g;

print join( ', ', @array ), "\n";

Upvotes: 0

Alan Moore
Alan Moore

Reputation: 75222

In case you're using a regex flavor (like Java's) that supports lookarounds but not conditionals, here's another approach:

(?=(<\d{1,2}>))(?!(?<=<)\1(?=>))\1

The first lookahead ensures that you're at the beginning of a tag and captures it for later use. The subexpression in the second lookahead matches the tag again, but only if it's preceded by a < and followed by a >. Making it a negative lookahead achieves the NOT(x AND y) semantics you're looking for. Finally, the second \1 matches the tag again, this time for real (i.e., not in a lookaround).

BTW, I could have used > instead of (?=>) in the second lookahead, but I think this way is easier to read and expresses my intent better.

Upvotes: 0

Beano
Beano

Reputation: 7831

Presuming that with the input set

 "<0> <<1>> <2>> <3> <4><<5>"

we want to match 0, 2, 3, 4 and 5.

The problem is that you need to use zero-width look-ahead and zero-width look-behind, but there are three cases to match, '<', '>' and '', and one not to match '<>'. Also if you want to be able to extract the marked expressions so that you can assign the match to an array, you need to avoid marking things you don't need. So I ended up with the non-elegant

use Data::Dumper;

my $a = "<0> <<1>> <2>> <3> <4><<5>";

my $brace_pair = qr/<[^<>]+>/;
my @matches = $a =~ /(?:(?<!<)$brace_pair(?!>))|(?:$brace_pair(?!>))|(?:(?<!<)$brace_pair)/g;

print Dumper(\@a);

If you wanted to cram this into a single expression - you could.

Upvotes: 1

Schwern
Schwern

Reputation: 164639

Here's an alternative to a single regex. Split it into a list at the >< boundary and then just exclude <...>.

#!/usr/bin/perl -lw

$s = "<0> <<1>> <2>> <3> <4>";

print join " ",
      map { /(\d+)/; $1 }
      grep !/^<.*>$/,
      split />\s*</, $s;

Upvotes: 0

Bojan Resnik
Bojan Resnik

Reputation: 7378

You can also try conditionals: (?(?<=<)(<\d{1,2}>(?!>))|(<\d{1,2}>))

Upvotes: 5

Konrad Rudolph
Konrad Rudolph

Reputation: 545508

You can look a negative look-behind zero-width assertion:

(?<!<)<\d{1,2}>

Upvotes: 2

Related Questions