user782161
user782161

Reputation: 63

Repeating numbered capture groups in Perl

Imagine I'm trying to parse the following html using Perl regex:

<h4>test</h4> <p>num1</p> <p>num2</p> <p>num3</p>
<h4>test</h4> <p>num1</p> <p>num2</p> <p>num3</p> <p>num4</p>

using the following regular expression:

<h4>([\w\s]*)</h4>(?:<p>([\w\s]+)</p>)+

How would the numbered groups be structured in Perl? $1 would obviously contain the <h4> tag text, but when the capture groups repeat, are the captured <p> tags then sent to $2 $3 and $4? Is there a good way to capture all the <p> tags in an array? Is this even something perl supports? Or am I forced to write a single regex for <h4>, then another for the <p>'s?

(I'm aware I could use HTML::Tree or something similar to parse the html, but this is just a simplified example I'm using to help describe the question, I'm really only interested in how repeated numbered capture groups work in Perl)

Upvotes: 3

Views: 467

Answers (1)

melwil
melwil

Reputation: 2553

When you repeat a capturing group, only the last matching group will be stored in the matcher.

If you want to get each match from a repeating group, you could use a replaceAll with a callback function or iterate through the matches one by one.

Most languages also have a "match all", which I don't know how to do in perl. This usually stores all matches into an array for you, but repeating groups are still stored only as last matched group.

Upvotes: 3

Related Questions