Reputation: 33
Basically I'm attempting to use preg_match to find all links with a PDF attachment and then add the entire url to an array. The part I'm struggling with is how to select everything before the match, upto the "quotes" of the <a href="">
. I want to do this so that I can loop through the array and do whatever I need to with each document. I just want to end up with '1234.pdf'
(plus any sub directory info) in the array.
Any ideas?
This is what I have so far, it only returns the match...
$string1 = "<a href='1234.pdf'>Document 1</a>";
$match = preg_match("/.pdf/i", $string1, $output);
Thanks
Upvotes: 3
Views: 3062
Reputation: 169211
You should really use a proper HTML parser (see netcoder's answer) and apply an XPath expression to solve this. If you are bound and determined to use a regex, try something like this:
$match = preg_match_all("/(?<=href=['\"])([^'\"]*\\.pdf[^'\"]*)(?=['\"])/",
$string1, $output);
Upvotes: 1
Reputation: 39
If I understand you correctly, it sounds like you need to use sub-patterns. Try something like this....
$match = preg_match("/href=\"(.*\.pdf)\"/i", $string1, $output);
The $output variable should be an array with index 0 containing full text matches and index 1 containing the text matched from between the parenthesis.
Upvotes: 0
Reputation: 67735
You should use a DOM parser to extract that information, because it's easier, and it's safer. Then you can use preg_match
to check if the link is actually a PDF or not:
$html = '<a href="foo.pdf">Foo</a>'.
'<a href="bar.jpg">Bar</a>'.
'<a href="baz.pdf">Baz</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href)) $result[] = $href;
}
print_r($result);
Outputs:
Array
(
[0] => foo.pdf
[1] => baz.pdf
)
Upvotes: 5