shane
shane

Reputation: 33

php preg_match. Add to array

Basically I'm attempting to use preg_match to find all links with a PDF attachment and then add the entire url to an array. The part I'm struggling with is how to select everything before the match, upto the "quotes" of the <a href="">. I want to do this so that I can loop through the array and do whatever I need to with each document. I just want to end up with '1234.pdf' (plus any sub directory info) in the array.

Any ideas?

This is what I have so far, it only returns the match...

$string1 = "<a href='1234.pdf'>Document 1</a>";

$match = preg_match("/.pdf/i", $string1, $output);

Thanks

Upvotes: 3

Views: 3062

Answers (3)

cdhowie
cdhowie

Reputation: 169211

You should really use a proper HTML parser (see netcoder's answer) and apply an XPath expression to solve this. If you are bound and determined to use a regex, try something like this:

$match = preg_match_all("/(?<=href=['\"])([^'\"]*\\.pdf[^'\"]*)(?=['\"])/",
                        $string1, $output);

Upvotes: 1

dt1021
dt1021

Reputation: 39

If I understand you correctly, it sounds like you need to use sub-patterns. Try something like this....

$match = preg_match("/href=\"(.*\.pdf)\"/i", $string1, $output);

The $output variable should be an array with index 0 containing full text matches and index 1 containing the text matched from between the parenthesis.

Upvotes: 0

netcoder
netcoder

Reputation: 67735

You should use a DOM parser to extract that information, because it's easier, and it's safer. Then you can use preg_match to check if the link is actually a PDF or not:

$html = '<a href="foo.pdf">Foo</a>'.
        '<a href="bar.jpg">Bar</a>'.
        '<a href="baz.pdf">Baz</a>';

$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');

$result = array();
foreach ($links as $link) {
   $href = $link->getAttribute('href');
   if (preg_match('/\.pdf$/i', $href)) $result[] = $href;
}

print_r($result);

Outputs:

Array
(
    [0] => foo.pdf
    [1] => baz.pdf
)

Upvotes: 5

Related Questions