Reputation: 2291
I am trying to extract text between 1 HTML tags but fail to do this:
HTML - Text to be extracted (http://www.alexa.com/siteinfo/google.com)
<span class="font-4 box1-r">3,757,209</span>
PHP
$data = frontend::file_get_contents_curl('http://www.alexa.com/siteinfo/'.$domain); // custom function that return the HTML string
$dom = new DOMDocument();
$dom->loadHTML(htmlentities($data));
$xpath = new DOMXpath($dom);
$backlinks = $xpath->query('//span[@class="font-4 box1-r"]/text()');
var_dump($backlinks); // returns null
Upvotes: 0
Views: 443
Reputation: 89285
The actual problem is due to htmlentities()
escaping all tag delimiters (<
and >
), so you end up loading a long string with no elements and attributes to DOMDocument()
:
$data = <<<HTML
<div><span class="font-4 box1-r">3,757,209</span></div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML(htmlentities($data));
echo $doc->saveXML();
eval.in demo (problem)
eval.in demo (solution)
output :
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><div><span class="font-4 box1-r">3,757,209</span></div></p></body></html>
Upvotes: 2
Reputation: 121
You can use the simplehtmldom library for this purpose (http://simplehtmldom.sourceforge.net/). And implement the code as:
require_once 'simplehtmldom/simple_html_dom.php';
$html = file_get_html('http://www.alexa.com/siteinfo/google.com');
echo $html->find('span.box1-r', 0)->plaintext;
Upvotes: 1