Milfzilla
Milfzilla

Reputation: 11

XPath not working as expected [php]

I often use XPath with php for parsing pages, but this time i don't understand the behavior with this specific page with the following code, I hope you can help me on this.

Code that I use to parse this page http://www.jeuxvideo.com/recherche.php?m=9&t=10&q=Call+of+duty :

<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;

$ch = curl_init();     
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);

/*
$search = array("<article", "</article>");
$replace = array("<div", "</div>");
$response = str_replace($search, $replace, $response);
*/

$dom = new DOMDocument();
@$dom->loadHTML($response);

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/a');

//$elements = $xpath->query('//div[@class="recherche-aphabetique-item"]/a');

count($elements);

var_dump($elements);
?>

fiddle to test it : http://phpfiddle.org/main/code/r9n6-d0j0

I just want to get all "a" nodes that are in "article" nodes with the class "recherche-aphabetique-item".

But it returns me nothing :/.

As you can see in the commented code I've tried to replace html5 elements articles to div, but I got the same behavior.

Thanks four your help.

Upvotes: 1

Views: 695

Answers (1)

Professor Abronsius
Professor Abronsius

Reputation: 33813

I'm seeing lots of DOMDocument::loadHTML(): Unexpected end tag errors - you should use the internal error handling functions of libxml to help fix this perhaps. Also, when I looked at the DOM of the remote site I could not see any a tags that would match the XPath query, only span tags

<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;

$ch = curl_init();     
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);

/* try to suppress errors using libxml */
libxml_use_internal_errors( true );

$dom = new DOMDocument();

/* additional flags for DOMDocument */
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;

@$dom->loadHTML($response);

libxml_clear_errors();

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/span');

count( $elements );
var_dump( $elements );
?>

output

object(DOMNodeList)#97 (1) { ["length"]=> int(94) } 

You could further simplify this perhaps by trying:

$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;

libxml_use_internal_errors( true );
$dom = new DOMDocument();
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
@$dom->loadHTMLFile($Query);
libxml_clear_errors();

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//article[@class="recherche-aphabetique-item"]/span');
count($elements);
foreach( $elements as $node )echo $node->nodeValue,'<br />';

Upvotes: 1

Related Questions