Charles Marsh
Charles Marsh

Reputation: 465

Simple HTML Dom PHP help required

I am doing some screen scraping work at the moment using PHP and Simple HTML Dom. I am struggling a little with finding some consistency within the targets markup. The divs are all named strangely. See example...

<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">

<div id="imCel1_02">
<div id="imCel1_02_Cont">
    <div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel1_00">
<div id="imCel1_00_Cont">
    <div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel0_00">
<div id="imCel0_00_Cont">
    <div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">

<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
    </div>
</div>
</div>

<div id="imCel1_01">
<div id="imCel1_01_Cont">
    <div id="imObj1_01">

<img src="images/2_h111.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel0_01">
<div id="imCel0_01_Cont">
    <div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">

<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
    </div>
</div>
</div>

<div id="imCel0_02">
<div id="imCel0_02_Cont">
    <div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
    </div>

</div>
</div>

</div>
<!-- Page END -->

There are two products within this page, they seem to be using div's like tables??? What elements can I target to obtain "image" "title" "description". I'm using this at the moment...

foreach($all_pages->find('img') as $src){

    if (strpos($src->src,"http://letoyvan.com") === false) {
        $src->src = "http://letoyvan.com/$src->src";
    }
       $product['image'][] = $src->src;
}

foreach($all_pages->find('p[class*=imAlign_left]') as $description){
       $product['description'][] =  $description->innertext;
}

foreach($all_pages->find('span[class*=fc3]') as $title){
       $product['title'][] =  $title->innertext;
}

Upvotes: 0

Views: 506

Answers (1)

Liam Bailey
Liam Bailey

Reputation: 5905

SImple html dom eats memmory like nothing on earth, DOMDocument is much better, here is an example:

    $page = <<< HTML
    <html>
    <head>
    <title>Test DOMDocument</title>
    </head>
    <body>
    <!-- Page START -->
    <h2>Small houses</h2>
    <p id="imPathTitle">Dolls Houses</p>
    <div id="imPage">

    <div id="imCel1_02">
    <div id="imCel1_02_Cont">
        <div id="imObj1_02">
    <img src="images/daisylane.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel1_00">
    <div id="imCel1_00_Cont">
        <div id="imObj1_00">
    <img src="images/1_h117.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel0_00">
    <div id="imCel0_00_Cont">
        <div id="imObj0_00">
    <p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">

    <br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
    <br /></span></p>
        </div>
    </div>
    </div>

    <div id="imCel1_01">
    <div id="imCel1_01_Cont">
        <div id="imObj1_01">

    <img src="images/2_h111.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel0_01">
    <div id="imCel0_01_Cont">
        <div id="imObj0_01">
    <p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
    <br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">

    <br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
    <br /></span></p>
        </div>
    </div>
    </div>

    <div id="imCel0_02">
    <div id="imCel0_02_Cont">
        <div id="imObj0_02">
    <p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
        </div>

    </div>
    </div>

    </div>
    <!-- Page END -->
    </body>
    </html>
HTML;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->load($page);
foreach($dom->getElementsByTagName('img') as $img)
{
    if (strpos($img->getAttribute('src'),"http://letoyvan.com") === false) {
        $src->src = "http://letoyvan.com/" . $img->getAttribute('src');
    }
       $product['image'][] = $img->getAttribute('src');

};

foreach($dom->getElementsByTagName('p') as $para) 
{
    if ($para->hasAttributes()) 
    {
         if ($para->getAttribute('class') == "imAlign_left")
         {
             $product['description'][] =  $para->nodeValue;
         }
    }
}

foreach($dom->getElementsByTagName('span') as $span) 
{
    if ($span->hasAttributes()) 
    {
         if ($span->getAttribute('class') == "fc3")
         {
            $product['title'][] =  $span->nodeValue;
         }
    }
}

If you need the description to retain the html you can use this function

 function DOMinnerHTML($element) 
    { 
        $innerHTML = ""; 
        $children = $element->childNodes; 
        foreach ($children as $child) 
        { 
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($element, true)); 
            $innerHTML = trim($tmp_dom->saveHTML()); 
        } 

        return $innerHTML;
    } 

Upvotes: 2

Related Questions