fetching content from a webpage using curl

Question

First of all have a look at here,

www.zedge.net/txts/4519/

this page has so many text messages , I want my script to open each of the message and download it, but i am having some problem,

This is my simple script to open the page,

The page download fine but how would i open every text message page inside this page one by one and save its content in a text file, I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?

I've this idea but don't know if it will work,

Downlaod this page,

www.zedge.net/txts/4519

look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.

Vyktor · Accepted Answer

The algorithm is pretty straight forward:

download www.zedge.net/txts/4519 with curl
parse it with DOM (or alternative) for links
either store them all into text file/database or process them on the fly with "subrequest"

// Load main page
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
$dom = new DOMDocument();
$dom->loadHTML( $contents);

// Filter all the links
$xPath = new DOMXPath( $dom);
$items = $xPath->query( '//a[class=myLink]');

foreach( $items as $link){
    $url = $link->getAttribute('href');
    if( strncmp( $url, 'http', 4) != 0){
        // Prepend http:// or something
    }

    // Open sub request
    curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
    $subContent = curl_exec( $ch);
}

See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.

Tips:

Use curl opt COOKIE_JAR_FILE
Use sleep(...) not to flood server
Set php time and memory limit

fetching content from a webpage using curl

Answers (2)

Related Questions