Reputation: 465
I am doing some screen scraping work at the moment using PHP and Simple HTML Dom. I am struggling a little with finding some consistency within the targets markup. The divs are all named strangely. See example...
<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">
<div id="imCel1_02">
<div id="imCel1_02_Cont">
<div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel1_00">
<div id="imCel1_00_Cont">
<div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_00">
<div id="imCel0_00_Cont">
<div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel1_01">
<div id="imCel1_01_Cont">
<div id="imObj1_01">
<img src="images/2_h111.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_01">
<div id="imCel0_01_Cont">
<div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">
<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel0_02">
<div id="imCel0_02_Cont">
<div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
</div>
</div>
</div>
</div>
<!-- Page END -->
There are two products within this page, they seem to be using div's like tables??? What elements can I target to obtain "image" "title" "description". I'm using this at the moment...
foreach($all_pages->find('img') as $src){
if (strpos($src->src,"http://letoyvan.com") === false) {
$src->src = "http://letoyvan.com/$src->src";
}
$product['image'][] = $src->src;
}
foreach($all_pages->find('p[class*=imAlign_left]') as $description){
$product['description'][] = $description->innertext;
}
foreach($all_pages->find('span[class*=fc3]') as $title){
$product['title'][] = $title->innertext;
}
Upvotes: 0
Views: 506
Reputation: 5905
SImple html dom eats memmory like nothing on earth, DOMDocument is much better, here is an example:
$page = <<< HTML
<html>
<head>
<title>Test DOMDocument</title>
</head>
<body>
<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">
<div id="imCel1_02">
<div id="imCel1_02_Cont">
<div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel1_00">
<div id="imCel1_00_Cont">
<div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_00">
<div id="imCel0_00_Cont">
<div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel1_01">
<div id="imCel1_01_Cont">
<div id="imObj1_01">
<img src="images/2_h111.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_01">
<div id="imCel0_01_Cont">
<div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">
<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel0_02">
<div id="imCel0_02_Cont">
<div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
</div>
</div>
</div>
</div>
<!-- Page END -->
</body>
</html>
HTML;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->load($page);
foreach($dom->getElementsByTagName('img') as $img)
{
if (strpos($img->getAttribute('src'),"http://letoyvan.com") === false) {
$src->src = "http://letoyvan.com/" . $img->getAttribute('src');
}
$product['image'][] = $img->getAttribute('src');
};
foreach($dom->getElementsByTagName('p') as $para)
{
if ($para->hasAttributes())
{
if ($para->getAttribute('class') == "imAlign_left")
{
$product['description'][] = $para->nodeValue;
}
}
}
foreach($dom->getElementsByTagName('span') as $span)
{
if ($span->hasAttributes())
{
if ($span->getAttribute('class') == "fc3")
{
$product['title'][] = $span->nodeValue;
}
}
}
If you need the description to retain the html you can use this function
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($element, true));
$innerHTML = trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
Upvotes: 2