Reputation: 11
I'd like to make a capture groups with text between <li>
and </li>
every time those HTML tags appear.
For example:
<li>packaged in a classy box</li> <li>measures 0.9x1.15" (2.3x2.8cm)</li> <li> <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong> </li>
So the output I need is:
Capture Group 1: packaged in a classy box
Capture Group 2: measures 0.9x1.15" (2.3x2.8cm)
Capture Group 3: <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong>
I tried look arounds, but they match everything between the very first and very last <li>
and <\li>
.
Upvotes: 1
Views: 104
Reputation: 4809
Simply use the .*?
pattern that means "Match a minimal number of any character".
For instance, suppose that your HTML input is saved in the file input.html. Here is a Perl program using a Perl regex that does the job:
cat input.html | perl -pe 's%<li>(.*?)</li>.*?<li>(.*?)</li>.*?<li>(.*?)</li>%capture group 1: \1\ncapture group 2: \2\ncapture group 3: \3\n%'
The output is:
capture group 1: packaged in a classy box
capture group 2: measures 0.9x1.15" (2.3x2.8cm)
capture group 3: <em>925 silver</em> and <em>14k white gold</em> arrive with an 20" <strong> silver chain </strong>
Note that using some regex to parse HTML is not the best way to do things properly. A far better way would be to apply a XSLT style sheet to your input using a XSLT processor like xsltproc, or use a DOM parser. A DOM parser may be the best tool since it may handle some malformed HTML input.
Upvotes: 1