Andreas Hunter
Andreas Hunter

Reputation: 5024

With Regex, how can I match a specific domain name inside a html document?

I have for example custom html document

<html>
<head>
    <title>Urls</title>
</head>
<body>
    <a href="https://www.google.com">Google</a>
    <a href="https://facebook.com">Facebook</a>
    <a href="http://www.example.com">Example</a>

    <p>Duis aute irure dolor in reprehenderit in voluptate velit esse
    cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
    proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

    <h1>Heading</h1>

    <a href="www.example.com">Example</a>
</body>
</html>

How I can extract form document domain names contain example.com string?

For example I've this regex <a.+?\s*href\s*=\s*["\']?([^"\'\s>]+)["\']? which can find all urls from href attribute. But how I use Regex to find a specific URL?

Upvotes: 1

Views: 108

Answers (1)

mickmackusa
mickmackusa

Reputation: 48041

To reliably extract the href values from all elements in the html document that contain www.example.com, I would use a combination of DOMDocument, Xpath, and strpos().

Xpath allows you to specifically target all href values in the document.

I am electing to trim the querystring from the href values for improved accuracy. I could not rely on parse_url() (though I would have preferred it) because your href urls are not always complete.

Code: (Demo)

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$result = [];
foreach ($xpath->query("//@href") as $href) {
    $noQueryString = explode('?', $href->nodeValue, 2)[0];
    if (strpos($noQueryString, 'www.example.com') !== false) {
        $result[] = $href->nodeValue;
    }
}
var_export($result);

Output:

array (
  0 => 'http://www.example.com',
  1 => 'www.example.com',
)

Upvotes: 1

Related Questions