Marcatectura
Marcatectura

Reputation: 1695

cURL Multiple URLs & parse result

I'm working on a PHP scraper to do the following:

  1. cURL several (always fewer than 10) URLs,

  2. Add the HTML from each URL to a DOMDocument,

  3. Parse that DOMdocument for <a> elements which link to PDFs,

  4. Store the hrefs for matching elements in an array.

I have steps 1 & 2 down (my code outputs the combined HTML for all URLs), but when I try to iterate through the result to find ` elements linking to PDFs, I get nothing (an empty array).

I've tried my parser code on a single cURL and it works (returns an array with the URLs for each pdf on that page).

Here's my cURL code:

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOMDoc
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
} 

And the parser code:

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

The full code looks like this:

<?php 

$urls = Array( 
 'http://www.example.com/about/1.htm', 
 'http://www.example.com/about/2.htm',
 'http://www.example.com/about/3.htm',
 'http://www.example.com/about/4.htm' 
); 

# Make DOM parser
$dom = new DOMDocument();

foreach ($urls as $url) { 
    $ch = curl_init($url);  
    $html = curl_exec($ch);
    # Exec and close CURL, suppressing errors
    @$dom->createDocumentFragment($html);
    curl_close($ch);
} 

#make pdf link array
$pdf_array = array();
# Iterate over all <a> tags and spit out those that end with ".pdf"
foreach($dom->getElementsByTagName('a') as $link) {
    # Show the <a href>
    $linkh = $link->getAttribute('href');
    $filend = ".pdf";
    # @ at beginning supresses string length warning
    @$pdftester = substr_compare($linkh, $filend, -4, 4, true);
    if ($pdftester === 0) {
        array_push($pdf_array, $linkh);
    }
}

print_r($pdf_array);

?> 

Any suggestions for what I'm doing wrong on the DOM parsing and PDF array building?

Upvotes: 1

Views: 2048

Answers (1)

mhall
mhall

Reputation: 3701

1. In order to get the HTML contents into $html you need to set the CURL option CURLOPT_RETURNTRANSFER flag. Otherwise it will just print the contents to the page and put a 1 (success) in $html.

CURLOPT_RETURNTRANSFER: TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);

2. The createDocumentFragment method does not do what you think it does.

This function creates a new instance of class DOMDocumentFragment. This node will not show up in the document unless it is inserted with (e.g.) DOMNode::appendChild().

So it does not read the HTML into the DOM document. It does not even take the $html parameter.

You probably would be better off using the loadHTML method, or loadHTMLFile if you want to skip the CURL and load the file directly into the DOM object in one go.

@$dom->loadHTML($html);    // Like this
@$dom->loadHTMLFile($url); // or this (removing the CURL lines)

3. It would make sense to extract the PDF links immediately after you have loaded the HTML into the DOM object, instead of trying to combine all pages into one before extracting. The code you have for that is actually working quite well :-)

Upvotes: 1

Related Questions