bs7280
bs7280

Reputation: 1094

php dom not accepting url

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.

this is a basic version of my code because I had to break it down alot to iscolate the problem.

$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}`

Upvotes: 0

Views: 1050

Answers (2)

Tim Wickstrom
Tim Wickstrom

Reputation: 5701

The code below works like a champ tested with your example data:

<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));

$url = $urlarray[0];

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
    $title = $element->getAttribute('title');
    $class = $element->getAttribute('class');
    if($class == 'result_link')
    {
        $title = str_replace('Synonyms of ', '', $title);
        echo $title . "<br />";
    }
}
?>

ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:

<?php
    $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

    $url = $urlarray[0];
    foreach($urlarray as $url) {
        if(!empty($url)) {
            $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

            $ch = curl_init();
            curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
            curl_setopt($ch, CURLOPT_URL,trim($url));
            curl_setopt($ch, CURLOPT_FAILONERROR, true);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_AUTOREFERER, true);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 10);
            $html = curl_exec($ch);

            $dom = new DOMDocument();
            @$dom->loadHTML($html);

            $anchors = $dom->getElementsByTagName('a');
            foreach($anchors as $element)
            {
                $title = $element->getAttribute('title');
                $class = $element->getAttribute('class');
                if($class == 'result_link')
                {
                    $title = str_replace('Synonyms of ', '', $title);
                    echo $title . "<br />";
                }
            }
            echo '<hr />';
        }
    }
?>

Upvotes: 1

Tim Wickstrom
Tim Wickstrom

Reputation: 5701

So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?

If so there is a problem here: $urlarray = explode("|", $contents = file_get_contents('urls.txt'));

are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?

I would var dump $contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.

If yes, then I would explode the into $urlarray, and var dump $urlarray[0]

if it looks right I would trim it before being sent to dom with trim($urlarray[0])

I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.

Let me know the results and I will try to help further, or post all sample code including URLS.txt

And I will run it locally

Upvotes: 0

Related Questions