Reputation: 1312
If the string is
<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>. Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="test.pdf">
The regex I have written is
/href=.*?.pdf/
This results in capturing the first 'href' and ending with '.pdf'. I need it to start with the second href instead. In other words it should only capture the href that ends with .pdf
How should I go about this using regex?
Upvotes: 0
Views: 277
Reputation: 158010
You should use DOM instead of using a regex in order to parse HTML or XML. In PHP there is the DOMDocument
class for this:
$doc = new DOMDocument();
$doc->loadHTML('<li>Your browser may be missing a required plug-in contained in <a href="http://get.adobe.com/reader/">Adobe Acrobat Reader</a>. Please reload this page after installing the missing component.<br />If this error persists, you can also save a copy of <a href="http://www.police.vt.edu/VTPD_v2.1/crime_stats/crime_logs/data/VT_2011-01_Crime_Log.pdf">');
$links = $doc->getElementsByTagName('a');
foreach($links as $link) {
echo $link->getAttribute('href');
}
Upvotes: 2
Reputation: 71538
You can try this regex:
/href=[^>]+\.pdf/
Most of the time, when you can avoid .*
or .+
(or their lazy versions), it's better :)
Also, don't forget to escape periods.
Upvotes: 2