dudemanbearpig
dudemanbearpig

Reputation: 1312

Regex - Find the match that is inside a match

If the string is

<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>.  Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="test.pdf">

The regex I have written is

/href=.*?.pdf/

This results in capturing the first 'href' and ending with '.pdf'. I need it to start with the second href instead. In other words it should only capture the href that ends with .pdf

How should I go about this using regex?

Upvotes: 0

Views: 277

Answers (2)

hek2mgl
hek2mgl

Reputation: 158010

You should use DOM instead of using a regex in order to parse HTML or XML. In PHP there is the DOMDocument class for this:

$doc = new DOMDocument();
$doc->loadHTML('<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>.  Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="http://www.police.vt.edu/VTPD_v2.1/crime_stats/crime_logs/data/VT_2011-01_Crime_Log.pdf">');

$links = $doc->getElementsByTagName('a');
foreach($links as $link) {
    echo $link->getAttribute('href');
}

Upvotes: 2

Jerry
Jerry

Reputation: 71538

You can try this regex:

/href=[^>]+\.pdf/

regex101 demo

Most of the time, when you can avoid .* or .+ (or their lazy versions), it's better :)

Also, don't forget to escape periods.

Upvotes: 2

Related Questions