Reputation: 5024
I have for example custom html document
<html>
<head>
<title>Urls</title>
</head>
<body>
<a href="https://www.google.com">Google</a>
<a href="https://facebook.com">Facebook</a>
<a href="http://www.example.com">Example</a>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
<h1>Heading</h1>
<a href="www.example.com">Example</a>
</body>
</html>
How I can extract form document domain names contain example.com string?
For example I've this regex <a.+?\s*href\s*=\s*["\']?([^"\'\s>]+)["\']?
which can find all urls from href attribute. But how I use Regex to find a specific URL?
Upvotes: 1
Views: 108
Reputation: 48041
To reliably extract the href values from all elements in the html document that contain www.example.com
, I would use a combination of DOMDocument, Xpath, and strpos()
.
Xpath allows you to specifically target all href
values in the document.
I am electing to trim the querystring from the href values for improved accuracy. I could not rely on parse_url()
(though I would have preferred it) because your href urls are not always complete.
Code: (Demo)
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query("//@href") as $href) {
$noQueryString = explode('?', $href->nodeValue, 2)[0];
if (strpos($noQueryString, 'www.example.com') !== false) {
$result[] = $href->nodeValue;
}
}
var_export($result);
Output:
array (
0 => 'http://www.example.com',
1 => 'www.example.com',
)
Upvotes: 1