Austin Becker
Austin Becker

Reputation: 11

Regular expression to make distinct capture groups within a repeating pattern of HTML tags

I'd like to make a capture groups with text between <li> and </li> every time those HTML tags appear.

For example:

<li>packaged in a classy box</li> <li>measures 0.9x1.15" (2.3x2.8cm)</li> <li> <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong> </li>

So the output I need is:

Capture Group 1: packaged in a classy box
Capture Group 2: measures 0.9x1.15" (2.3x2.8cm)
Capture Group 3: <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong>

I tried look arounds, but they match everything between the very first and very last <li> and <\li>.

Upvotes: 1

Views: 104

Answers (1)

Alexandre Fenyo
Alexandre Fenyo

Reputation: 4809

Simply use the .*? pattern that means "Match a minimal number of any character".

For instance, suppose that your HTML input is saved in the file input.html. Here is a Perl program using a Perl regex that does the job:

cat input.html | perl -pe 's%<li>(.*?)</li>.*?<li>(.*?)</li>.*?<li>(.*?)</li>%capture group 1: \1\ncapture group 2: \2\ncapture group 3: \3\n%'

The output is:

capture group 1: packaged in a classy box
capture group 2: measures 0.9x1.15" (2.3x2.8cm)
capture group 3:  <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong>

Note that using some regex to parse HTML is not the best way to do things properly. A far better way would be to apply a XSLT style sheet to your input using a XSLT processor like xsltproc, or use a DOM parser. A DOM parser may be the best tool since it may handle some malformed HTML input.

Upvotes: 1

Related Questions