Reputation: 3952

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.

The information is in a div:-

<div class="pager">
<span class="page-numbers current">1</span>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
<a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
<a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
<a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
<span class="page-numbers dots">&hellip;</span>

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
</div>

The value I need is 15 but this could be any number depending on the site but will always be in the same position.

How could I read this value easily and assign it to a variable in PHP.

Thanks

Jonathan

Upvotes: 0

Answers (6)

Jonathan Lyon

Reputation: 3952

Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-

function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
@$doc->loadHTML($sSourceData);

$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {

  $lastpage = $nodelist->item(0)->nodeValue;
  $users = $lastpage * 35;
  $userurl = $userurl.'?page='.$lastpage;

  $sSourceData = file_get_contents($userurl);

$doc = new DOMDocument();
@$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';

}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}


}

Upvotes: 0

VolkerK

Reputation: 96159

You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.

(edit: oops, that's not what you're looking for, see second code snippet)

$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
<a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
<a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
<a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
<span class="page-numbers dots">&hellip;</span>

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
</div>
</body></html>';

$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url) 
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[@class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';

edit: does this

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>

(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.

//div[@class="pager"] <- select each <div> where the attribute class equals "pager"
//div[@class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[@class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[@class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>

( you might want to fetch a good XPath tutorial ;-) )

$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
  echo $nodelist->item(0)->nodeValue;
}
else {
  echo 'not found';
}

Upvotes: 2

Brienne Schroth

Reputation: 2457

perhaps

$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
    if( $node.class == "page-numbers" && $node.value > $maxPageNum )
    {
        $maxPageNum = $node.value;
    }
}

I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.

Upvotes: 0

pivotal

Reputation: 746

This is something you would might want to use a xpath for - which requires loading the page as a dom document object:

$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];

I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.

Upvotes: 0

Ivan Nevostruev

Reputation: 28713

You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:

// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15

Upvotes: 0

Elzo Valugi

Reputation: 27856

There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.

Upvotes: 0

extract value from web page

Answers (6)

Related Questions