Reputation: 3
I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags for example if the html source is like below,
<a href="http://aaa.com">http://aaa.com</a>
Neither of http://aaa.com
should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!
Upvotes: 0
Views: 57
Reputation: 76405
Seeing as you're not trying to extract urls that are the href
attribute of an a
node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.
Upvotes: 0
Reputation: 10563
Don't use Regex. Use DOM
$html = '<a href="http://aaa.com">http://aaa.com</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}
Upvotes: 1
Reputation: 593
Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName
to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
<a href="http://aaa.com">http://aaa.com</a>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}
Upvotes: 1