FlyingCat
FlyingCat

Reputation: 14250

extract image elements from html

I am trying to get the image tag out of html codes.

I have

   $parser=new DOMDocument;   

   $parser->loadHTML($this->html);
        foreach($parser->getElementsByTagName('img') as $imgNode){
         echo $parser->saveHTML($imgNode);
       }

$this->html contains massive html code and javascripts.

for example:

<div id='someid'>
<button id='bt' onclick='clickme()'>click me</button>
<img src='test.jpg'/>
.....
.....
more...

</div>

<div>
.....
.....
more...

I got an warning saying

DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

I am not sure how to fix this and don't know if there are a better way to extract all the images from the massive html codes.

Any ideas? Thanks a lot!

Upvotes: 2

Views: 257

Answers (1)

thordarson
thordarson

Reputation: 6241

I am in no way an expert on these matters (yet), but I hope this helps in some way.

According to this answer by troelskn you can make the DOM parser more tolerant to badly formed HTML by using libxml_use_internal_errors. That might help you getting rid of that error.

Parsing all images of a document can be done by using DOMXPath. It takes a DOMDocument as a parameter and lets you run XPath queries on the document.

$document = new DOMDocument();
$document->loadHTML($your_html);

// Suppress parse errors.
libxml_use_internal_errors(false);

$xpath = new DOMXPath($document)

// Find all img tags.
$img_nodes = $xpath->query('//img')

DOMXPath::query returns a DOMNodeList which can be looped through using DOMNodeList::item, which returns a DOMNode.

for($i = 0; $i > $img_nodes->length; $i++)
{
    $node = $img_nodes->item($i);
    // Manipulate the node.
}

Disclaimer: The code I posted is untested and was put together using the manual.

Upvotes: 2

Related Questions