Reputation: 1711
I have an XML feed that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<smf:xml-feed xmlns:smf="http://www.simplemachines.org/" xmlns="http://www.simplemachines.org/xml/recent" xml:lang="en-US">
<recent-post>
<time>April 04, 2021, 04:20:47 pm</time>
<id>1909114</id>
<subject>Title</subject>
<body><![CDATA[<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>]]></body>
</recent-post>
</smf:xml-feed>
I want to extract the image src
from the body
and then save it to a new XML file that includes an element for image
.
So far, I have
$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'smf:xml-feed/recent-post/body' );
foreach( $nodes as $node )
{
$html = new DOMDocument();
$html->loadHTML( $node->nodeValue );
$src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
echo $src;
}
But when I try to print out $nodes
, I get nothing. What am I missing?
Upvotes: 0
Views: 383
Reputation: 19492
This looks like a Simple Machines feed. However the namespaces are missing and the "body" element should be a CDATA section with an html fragment as text. I would expect to look like this:
<smf:xml-feed
xmlns:smf="http://www.simplemachines.org/"
xmlns="http://www.simplemachines.org/xml/recent"
xml:lang="en-US">
<recent-post>
<time>April 04, 2021, 04:20:47 pm</time>
<id>1909114</id>
<subject>Title</subject>
<body><![CDATA[
<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>
]]>
</body>
</recent-post>
</smf:xml-feed>
The XML defines two namespaces. To use them in Xpath expressions you have to register prefixes for them. I suggest iterating the recent-post
elements. Then fetch the text content of specific child nodes using expression with string casts.
The body
element contains the HTML fragment as text, so you need to load it into a separate document. Then you can Xpath on this document to fetch the src
of the img
:
$feedDocument = new DOMDocument();
$feedDocument->preserveWhiteSpace = false;
$feedDocument->loadXML($xmlString);
$feedXpath = new DOMXPath($feedDocument);
// register namespaces
$feedXpath->registerNamespace('smf', 'http://www.simplemachines.org/');
$feedXpath->registerNamespace('recent', 'http://www.simplemachines.org/xml/recent');
// iterate the posts
foreach($feedXpath->evaluate('/smf:xml-feed/recent:recent-post') as $post) {
// demo: fetch post subject as string
var_dump($feedXpath->evaluate('string(recent:subject)', $post));
// create a document for the HTML fragment
$html = new DOMDocument();
$html->loadHTML(
// load the text content of the body element
$feedXpath->evaluate('string(recent:body)', $post),
// just a fragment, no need for html document elements or DTD
LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
);
// Xpath instance for the html document
$htmlXpath = new DOMXpath($html);
// fetch first src attribute of an img
$src = $htmlXpath->evaluate('string(//img/@src)');
var_dump($src);
}
Output:
string(5) "Title"
string(9) "image.png"
Upvotes: 1
Reputation: 57121
There are several problems with your code, some which I have to make assumptions on...
In
$dom->loadXML($xml);
this is expecting the actual source XML and not a URL, you would need to use load()
instead.
I would have to assume that the smf
namespace is defined somewhere in the document, for testing purposes I have altered the sample XML to...
<smf:xml-feed xml:lang="en-US" xmlns:smf="http://a.com">
I've also altered the query to
//smf:xml-feed/recent-post/body
to test this code.
Finally, not sure why you create another document inside the loop, but you should be able to process this directly from the node in the loop, so I use $node
as the base for the getElementsByTagName()
call...
$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->load($xml);
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( '//smf:xml-feed/recent-post/body' );
foreach( $nodes as $node )
{
$src = $node->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
echo $src;
}
Upvotes: 1