user3859822
user3859822

Reputation: 3

php regex selecting url from html source

I'm new to stackoverflow and from South Korea.

I'm having difficulties with regex with php.

I want to select all the urls from user submitted html source.

The restrictions I want to make are following.

Select urls EXCEPT

Here is my current regex stage.

/(?<![\"=])https?\:\/\/[^\"\s<>]+/i

but with this regex, I can't achieve the first rule.

I tried to add negative lookahead at the end of my current regex like

/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i

It still chooses the second url in the a tag like below.

http://aaa.co

We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!

Upvotes: 0

Views: 57

Answers (3)

Elias Van Ootegem
Elias Van Ootegem

Reputation: 76405

Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:

$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.

An alternative approach would be this:

$text = strip_tags($htmlString);//gets rid of makrup.

Upvotes: 0

Linga
Linga

Reputation: 10563

Don't use Regex. Use DOM

$html = '<a href="http://aaa.com">http://aaa.com</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
    if($a->hasAttribute('href')){
        echo $a->getAttribute('href');
    }
    //$a->nodeValue; // If you want the text in <a> tag
}

Upvotes: 1

chh
chh

Reputation: 593

Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.

The DOM works just like in the browser and you can use getElementsByTagName to get all links.

I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):

<?php

$html = <<<HTML
<a href="http://aaa.com">http://aaa.com</a>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $link) {
    var_dump($link->getAttribute('href'));
    // Output: http://aaa.com
}

Upvotes: 1

Related Questions