Reputation: 1094
I am trying to create a program that will open a text file with urls seperated by |
. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.
this is a basic version of my code because I had to break it down alot to iscolate the problem.
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}`
Upvotes: 0
Views: 1050
Reputation: 5701
The code below works like a champ tested with your example data:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
?>
ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
foreach($urlarray as $url) {
if(!empty($url)) {
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,trim($url));
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
echo '<hr />';
}
}
?>
Upvotes: 1
Reputation: 5701
So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?
If so there is a problem here: $urlarray = explode("|", $contents = file_get_contents('urls.txt'));
are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?
I would var dump $contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.
If yes, then I would explode the into $urlarray, and var dump $urlarray[0]
if it looks right I would trim it before being sent to dom with trim($urlarray[0])
I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.
Let me know the results and I will try to help further, or post all sample code including URLS.txt
And I will run it locally
Upvotes: 0