Tyssen
Tyssen

Reputation: 1711

Extract img src from a text element in an XML feed

I have an XML feed that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<smf:xml-feed xmlns:smf="http://www.simplemachines.org/" xmlns="http://www.simplemachines.org/xml/recent" xml:lang="en-US">
  <recent-post>
    <time>April 04, 2021, 04:20:47 pm</time>
    <id>1909114</id>
    <subject>Title</subject>
    <body><![CDATA[<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>]]></body>
  </recent-post>
</smf:xml-feed>

I want to extract the image src from the body and then save it to a new XML file that includes an element for image.

So far, I have

$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->loadXML($xml);

$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'smf:xml-feed/recent-post/body' );

foreach( $nodes as $node )
{
    $html = new DOMDocument();
    $html->loadHTML( $node->nodeValue );
    $src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
    echo $src;
}

But when I try to print out $nodes, I get nothing. What am I missing?

Upvotes: 0

Views: 383

Answers (2)

ThW
ThW

Reputation: 19492

This looks like a Simple Machines feed. However the namespaces are missing and the "body" element should be a CDATA section with an html fragment as text. I would expect to look like this:

<smf:xml-feed 
  xmlns:smf="http://www.simplemachines.org/" 
  xmlns="http://www.simplemachines.org/xml/recent" 
  xml:lang="en-US">
    <recent-post>
    <time>April 04, 2021, 04:20:47 pm</time>
    <id>1909114</id>
    <subject>Title</subject>
    <body><![CDATA[
    <a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>
    ]]>
    </body>
  </recent-post>
</smf:xml-feed>

The XML defines two namespaces. To use them in Xpath expressions you have to register prefixes for them. I suggest iterating the recent-post elements. Then fetch the text content of specific child nodes using expression with string casts.

The body element contains the HTML fragment as text, so you need to load it into a separate document. Then you can Xpath on this document to fetch the src of the img:

$feedDocument = new DOMDocument();
$feedDocument->preserveWhiteSpace = false;
$feedDocument->loadXML($xmlString);
$feedXpath = new DOMXPath($feedDocument);

// register namespaces
$feedXpath->registerNamespace('smf', 'http://www.simplemachines.org/');
$feedXpath->registerNamespace('recent', 'http://www.simplemachines.org/xml/recent');

// iterate the posts
foreach($feedXpath->evaluate('/smf:xml-feed/recent:recent-post') as $post) {
    // demo: fetch post subject as string
    var_dump($feedXpath->evaluate('string(recent:subject)', $post));
    
    // create a document for the HTML fragment
    $html = new DOMDocument();
    $html->loadHTML(
        // load the text content of the body element
        $feedXpath->evaluate('string(recent:body)', $post),
        // just a fragment, no need for html document elements or DTD
        LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
    );
    // Xpath instance for the html document
    $htmlXpath = new DOMXpath($html);
    // fetch first src attribute of an img 
    $src = $htmlXpath->evaluate('string(//img/@src)');
    var_dump($src);
}

Output:

string(5) "Title"
string(9) "image.png"

Upvotes: 1

Nigel Ren
Nigel Ren

Reputation: 57121

There are several problems with your code, some which I have to make assumptions on...

In

$dom->loadXML($xml);

this is expecting the actual source XML and not a URL, you would need to use load() instead.

I would have to assume that the smf namespace is defined somewhere in the document, for testing purposes I have altered the sample XML to...

<smf:xml-feed xml:lang="en-US" xmlns:smf="http://a.com">

I've also altered the query to

//smf:xml-feed/recent-post/body

to test this code.

Finally, not sure why you create another document inside the loop, but you should be able to process this directly from the node in the loop, so I use $node as the base for the getElementsByTagName() call...

$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->load($xml);

$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( '//smf:xml-feed/recent-post/body' );

foreach( $nodes as $node )
{
    $src = $node->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
    echo $src;
}

Upvotes: 1

Related Questions