user1032289
user1032289

Reputation: 502

Parse external webpage and extract all URLs and link text from the content

I want to parse external webpages and extract all URLs and link text from the content using PHP.

For example,

$content="<a href="http://google.com" target="_blank"> google</a> is very good search engine <a href="http://gmail.com" target="_blank">Gmail </a> is provided by google.

Output:

http//google.com      google 
http//gmail.com     Gmail 

Suggestions are much appreciated!

Upvotes: 0

Views: 1847

Answers (2)

fardjad
fardjad

Reputation: 20424

If you want to extract the url and text using regular expressions then the following should work:

<\s*a\s*href\s*=\"(?<url>.*)\">(?<text>.*)</a>

However parsing HTML with RegEx is not a good idea, you can use DOM class instead.

Edit

$content = "< a href="http://google.com" target="_blank"> google</a> is very good search engine < a href="http://gmail.com" target="_blank">Gmail </a> is provided by google .";

$html = new DOMDocument();
$html->loadHTML($content);

$anchors = $html->getElementsByTagName('a');
foreach ($anchors as $anchor) {
       echo $anchor->getAttribute('href') . "\t" . $anchor->nodeValue;
}

Upvotes: 2

Eray
Eray

Reputation: 7128

You can use this REGEX pattern href="([a-zA-Z0-9://. ]+)"

Example usage

$pattern = 'href="([a-zA-Z0-9://. ]+)"';
$content = file_get_contents(FILE NAME HERE);
preg_match($pattern, $content, $matches);

print_r($matches);

This will list all links . ANd then you can parse them.

Upvotes: 0

Related Questions