Reputation: 21
Php Masters,
Here's code for grabbing links from google.
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";
?>
It echoes results sort of like this:
https://www.google.com/webhp?tab=ww
http://www.google.com/imghp?hl=bn&tab=wi
Now, I'm still a learner and need your help. I'd like to convert the above code so using the DOM it is able to extract all urls and their anchor texts from all links residing on a chosen webpage no matter what format the links are in. Formats such as:
<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>
The anchor texts should sit underneath each extracted url. And there should be a line in-between each listed item. Such as:
http://stackoverflow.com<br>
A programmer's forum<br>
<br>
http://google.com<br>
A searchengine<br>
<br>
http://yahoo.com<br>
An Index<br>
<br>
And so on. I'd also appreciate a cURL version (not using DOM) from you fine folks that performs the same result. This cURL did not exactly work quite the way I wanted:
<?php
$curl = curl_init('http://stackoverflow.com/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
print "Not found";
?>
Any chance this can be achieved with cURL (not using DOM) without regex ? I'd like to see a regex sample and one without regex too. Finally, I really don't want to be using limited functions such as the get_file() and the like.
Thank you!
EDIT 1st time: This aint working:
<?php
# Use the Curl extension to query Google and get back a page of results
$url = "http://fiverr.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Devshed Forum.
# The @ before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
@$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->getAttribute('href');
echo "<br />";
echo $link->nodeValue;
}
?>
I see a complete white blank page. No echoes.
2nd EDIT: I updated the script and see these errors:
**Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 97 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 119 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 123 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 149 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 159 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 162 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 174 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 179 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 184 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 185 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 348 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 352 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 356 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 358 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 361 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 838 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 845 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 848 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 851 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID display-name already defined in Entity, line: 895 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): ID m-address already defined in Entity, line: 899 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1155 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 1168 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag g invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1172 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 1175 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag svg invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194
Warning: DOMDocument::loadHTML(): Tag path invalid in Entity, line: 1208 in C:\xampp\htdocs\cURL\crawler.php on line 194**
The update:
<?php
/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents('https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl'));
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
?>
3rd EDIT: Ok. Luis Munoz's sample has worked for me so far. But, that sample of his, aswell as my original sample was not based on the crawler following links found on the fetched page. Therefore, now wish to extend the function of our script for the crawler to follow links found on the fetched page. Here are my 2 attempts in 2 different ways to build a simple link following crawler. What I am trying to do is learn to build a simple web crawler that follows links and extracts links found on new pages followed.
STEPS: So at first, I will feed it a url to start with. It will then fetch that page and extract all the links into a single array and echo the extracted links so at each page load you only see the extracted links echoed. Then it will fetch each of those links pages and extract all their links into a single array and echo the extracted links likewise. It will do this until it reaches it's max link deep level set.
Attempt 1
<?php
include('simple_html_dom.php');
$current_link_crawling_level = 0;
$link_crawling_level_max = 2;
if($current_link_crawling_level == $link_crawling_level_max)
{
echo "link crawling depth level reached!";
sleep(5);
exit();
}
else
{
$url = 'http://php.net/manual-lookup.php?
pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $a<br />\n";
print_r($links);
sleep(1);
$url = '$value';
$curl = curl_init($a);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
$links = array();
foreach($html->find('a') as $a)
{
$links[] = $a->href;
echo "Value: $a<br />\n";
print_r($links);
sleep(1);
}
echo "Value: $a<br />\n";
print_r($links);
}
}
?>
2nd Attempt:
<?php
include('simple_html_dom.php');
$current_link_crawling_level = 0;
$link_crawling_level_max = 2;
if($current_link_crawling_level == $link_crawling_level_max)
{
echo "link crawling depth level reached!";
sleep(5);
exit();
}
else
{
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING))
{
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
sleep(1);
$url = 'http://php.net/manual-lookup.php?pattern=str_get_html&scope=quickref';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
$current_link_crawling_level++;
//to fetch all hyperlinks from the webpage
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($html, LIBXML_NOWARNING))
{
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
sleep(1);
}
echo '</pre>';
}
else
{
echo "Failed to load html.";
}
}
}
else
{
echo "Failed to load html.";
}
}
?>
I would appreciate any code sample that is a very simple one for beginners. Better, if in procedural style as I am a beginner.
Thank You!
Upvotes: 2
Views: 3505
Reputation: 12662
Let's use cURL instead of file_get_contents
since it's a better option to handle HTTPS requests. Also, add warning suppression control to avoid messages about broken HTML
<?php
/*
Using PHP's DOM functions to
fetch hyperlinks and their anchor text
*/
$url = 'https://stackoverflow.com/questions/50381348/extract-urls-anchor-texts-from-links-on-a-webpage-fetched-by-php-or-curl';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);
// Hide HTML warnings
libxml_use_internal_errors(true);
$dom = new DOMDocument;
if($dom->loadHTML($data, LIBXML_NOWARNING)){
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
}
echo '</pre>';
}else{
echo "Failed to load html.";
}
?>
Upvotes: 2