With Regex, how can I match a specific domain name inside a html document?

Question

I have for example custom html document



    Urls


    Google
    Facebook
    Example

    Duis aute irure dolor in reprehenderit in voluptate velit esse
    cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
    proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

    Heading

    Example

How I can extract form document domain names contain example.com string?

For example I've this regex ]+)["\']? which can find all urls from href attribute. But how I use Regex to find a specific URL?

mickmackusa · Accepted Answer

To reliably extract the href values from all elements in the html document that contain www.example.com, I would use a combination of DOMDocument, Xpath, and strpos().

Xpath allows you to specifically target all href values in the document.

I am electing to trim the querystring from the href values for improved accuracy. I could not rely on parse_url() (though I would have preferred it) because your href urls are not always complete.

Code: (Demo)

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$result = [];
foreach ($xpath->query("//@href") as $href) {
    $noQueryString = explode('?', $href->nodeValue, 2)[0];
    if (strpos($noQueryString, 'www.example.com') !== false) {
        $result[] = $href->nodeValue;
    }
}
var_export($result);

Output:

array (
  0 => 'http://www.example.com',
  1 => 'www.example.com',
)

With Regex, how can I match a specific domain name inside a html document?

Answers (1)

Related Questions