Reputation: 4034
I want to get the content with main div without any more tag for example I want to scrap "Winter Skate brought to you by Harvard Pilgrim HealthCare, offering day and evening public skating, is the perfect remedy to cabin fever this winter. " from the given code. I'm using xpath with simple html dom and here is my code
foreach($dom->find('//*[@id="main"]/text()[1]') as $element){
$details=$element;
}
but it's neither getting any element nor go in the foreach. Can you please suggest me any solution?
<div id="main">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by Harvard Pilgrim HealthCare, offering day and evening public skating, is the perfect remedy to cabin fever this winter.<br />
<br />
A fun and affordable activity for parents with children, Winter Skate is also an ideal lunch break getaway and a romantic addition to a dinner date at Patriot Place. <br />
<br />
The 60-by-140-foot, refrigerated ice surface is designed specifically for recreational skating and the professional surface is large enough to accommodate beginners and experts alike.<br />
<br />
On-site skate rentals, concessions and bathrooms are available and parking is free.<br />
<br />
<br />
<b>Concessions</b><br />
Dunkin Donuts will be on-site with coffee, hot chocolate and donuts available for purchase. Additionally, Patriot Place features 16 dining and quick service restaurants including: Bar Louie, Baskin Robbins, Blue Fin Lounge, CBS Scene, Davio’s, Five Guys Burgers, Godiva, Olive Garden, Qdoba, Red Robin, Skipjack’s, Studio 3, Tastings Wine Bar & Bistro, Tavolino Pizza Gourmet, Twenty8 Food & Spirits.<br />
<br />
NOTE: Hours may occasionally vary due to inclement weather, Patriots home games, or pre-scheduled private events – please check back or call 508-203-2100<br><br>
<a name='hours' class='ranchor'></a>
</div>
Upvotes: 1
Views: 732
Reputation: 19482
SimpleHtmlDom does not implement the official W3C DOM Api. It uses CSS Selectors, not XPath. CSS Selectors can not be used to select text nodes, they only match element nodes.
You can use PHPs standard, native DOM extension:
$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
var_dump(
$xpath->evaluate('string(//*[@id="main"]/text()[normalize-space() != ""][1])')
);
Output:
string(149) "Winter Skate brought to you by Harvard Pilgrim HealthCare, offering day and evening public skating, is the perfect remedy to cabin fever this winter."
[normalize-space() != ""]
is a condition that filter nodes that contain only whitespaces.
string()
casts the first node in the result list into a string and avoids the need for the loop.
DOMDocument::loadHTML()
and DOMDocument::loadHTMLFile()
try to repair invalid html source. For example they add html
and body
if they do not exists. This can change the HTML so it is a good idea to save the HTML back to a string to get the new structure:
$html = <<<'HTML'
<div id="main" class="one" class="two">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by ...
HTML;
$dom = new DOMDocument();
@$dom->loadHtml($html);
echo $dom->saveHtml();
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div id="main" class="one">
<div>a</div>
<div>b</div>
<div>c</div>
<a name="abc"></a>Winter Skate brought to you by ...</div></body></html>
Additionally the @ blocks errors and warnings from the HTML parsing. This works most of the time but a better way is to use the libxml functions and handle/log the errors:
$dom = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom->loadHtml($html);
var_dump(libxml_get_errors());
Output:
array(1) {
[0]=>
object(LibXMLError)#2 (6) {
["level"]=>
int(2)
["code"]=>
int(42)
["column"]=>
int(39)
["message"]=>
string(26) "Attribute class redefined
"
["file"]=>
string(0) ""
["line"]=>
int(1)
}
}
If it reports an empty source, you need to check that the DOMDocument::loadHTMLFile can fetch it, try to get it with file_get_contents().
Upvotes: 1