rinogo
rinogo

Reputation: 9163

Multiple matches within a regex group?

I need to match all 'tags' (e.g. %thisIsATag%) that occur within XML attributes. (Note: I'm guaranteed to receive valid XML, so there is no need to use full DOM traversal). My regex is working, except when there are two tags in a single attribute, only the last one is returned.

In other words, this regex should find tag1, tag2, ..., tag6. However, it omits tag2 and tag5.

Here's a fun little test harness for you (PHP):

<?php

$xml = <<<XML
<data>
 <slideshow width="625" height="250">

  <screen delay="%tag1%">
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold" text="Screen One!%tag2% %tag3%"/>
   </text>
  </screen>

  <screen delay='%tag4%'>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize='32' fontstyle='bold' text='Screen 2!%tag5%%tag6%'/>
   </text>
  </screen>

  <screen>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold"  text="Screen Tres!"/>
   </text>
  </screen>

  <screen>
   <text x="30%" y="50%" animatefromx="800">
    <line fontsize="32" fontstyle="bold"  text="Screen FOURRRR!"/>
   </text>
  </screen>

 </slideshow>
</data>
XML;

$matches = null;
preg_match_all('#<[^>]+("([^%>"]*%([^%>"]+)%[^%>"]*)+"|\'([^%>\']*%([^%>\']+)%[^%>\']*)+\')[^>]*>#i', $xml, $matches);

print_r($matches);
?>

Thanks! :)

Upvotes: 1

Views: 2358

Answers (3)

Alan Moore
Alan Moore

Reputation: 75232

What you're trying to do is recover intermediate captures from groups that match more than once per regex match. As far as I know, only .NET and Perl 6 provide that capability. You'll have to do the job in two stages: match an attribute value with one or more %tag% sequences in it, then break out the individual sequences.

You don't seem to care which XML tag or attribute the values are associated with, so you could use this, somewhat simpler regex to find the values with %tag% sequences in them:

'#"([^"%<>]*+%[^%"]++%[^"]*+)"|\'([^\'%<>]*+%[^%\']++%[^\']*+)\'#'

EDIT: That regex captures the attribute value in group 1 or group 2, depending in which quotes it used. Here's another version that merges the alternatives so it can always save the value in group 2:

'#(["\'])((?:(?![%<>]|\1).)*+%(?:(?!%|\1).)++%(?:(?!\1).)*+)\1#'

Upvotes: 2

Mentee
Mentee

Reputation: 51

%\w+% would be an even simpler way of doing this.

Upvotes: 2

RichieHindle
RichieHindle

Reputation: 281485

Is this:

(%[a-zA-Z0-9]+%)

not enough? In your example, tags don't appear anywhere outside of attribute values - can they?

Upvotes: 2

Related Questions