Sufiyan Ghori
Sufiyan Ghori

Reputation: 18743

fetching content from a webpage using curl

First of all have a look at here,

www.zedge.net/txts/4519/

this page has so many text messages , I want my script to open each of the message and download it, but i am having some problem,

This is my simple script to open the page,

<?php
 $ch = curl_init();
 curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
 $contents = curl_exec ($ch);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_close ($ch);
?>

The page download fine but how would i open every text message page inside this page one by one and save its content in a text file, I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?

I've this idea but don't know if it will work,

Downlaod this page,

www.zedge.net/txts/4519

look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.

Upvotes: 1

Views: 5659

Answers (2)

Jay Katira
Jay Katira

Reputation: 181

I used DOM for my code part. I called my desire page and filtered data using getElementsByTagName('td') Here i want the status of my relays from the device page. every time i want updated status of relays. for that i used below code.

$keywords = array();
$domain = array('http://USERNAME:PASSWORD@URL/index.htm');
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
foreach ($domain as $key => $value) {
    @$doc->loadHTMLFile($value);
    //$anchor_tags = $doc->getElementsByTagName('table');
    //$anchor_tags = $doc->getElementsByTagName('tr');
    $anchor_tags = $doc->getElementsByTagName('td');
    foreach ($anchor_tags as $tag) {
        $keywords[] = strtolower($tag->nodeValue);
        //echo $keywords[0];
    }
}

Then i get my desired relay name and status in $keywords[] array. Here i am sharing screenshot of Output.

If you want to read all messages in the main page. then first you have to collect all link for separate messages. Then you can use it for further same process.

Upvotes: 2

Vyktor
Vyktor

Reputation: 20997

The algorithm is pretty straight forward:

  • download www.zedge.net/txts/4519 with curl
  • parse it with DOM (or alternative) for links
  • either store them all into text file/database or process them on the fly with "subrequest"

 

// Load main page
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
$dom = new DOMDocument();
$dom->loadHTML( $contents);

// Filter all the links
$xPath = new DOMXPath( $dom);
$items = $xPath->query( '//a[class=myLink]');

foreach( $items as $link){
    $url = $link->getAttribute('href');
    if( strncmp( $url, 'http', 4) != 0){
        // Prepend http:// or something
    }

    // Open sub request
    curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
    $subContent = curl_exec( $ch);
}

See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.

Tips:

  • Use curl opt COOKIE_JAR_FILE
  • Use sleep(...) not to flood server
  • Set php time and memory limit

Upvotes: 3

Related Questions