Reputation: 2663
I have an XML file that I am pulling from the web and parsing. One of the items in the XML is a 'content' value that has HTML. I am using XML::Simple::XMLin to parse the file like so:
$xml= eval { $data->XMLin($xmldata, forcearray => 1, suppressempty=> +'') };
When I use Data::Dumper
to dump the hash, I discovered that SimpleXML
is parsing the HTML into the hash tree:
'content' => { 'div' => [ { 'xmlns' => 'http://www.w3.org/1999/xhtml', 'p' => [ { 'a' => [ { 'href' => 'http://miamiherald.typepad.com/.a/6a00d83451b26169e20133ec6f4491970b-pi', 'style' => 'FLOAT: left', 'img' => [ etc.....
This is not what I want. I want to just grab content inside of this entry. How do I do this?
Upvotes: 3
Views: 1214
Reputation: 132719
My general rule is that when XML::Simple starts to fail, it's time to move on to another XML processing module. XML::Simple
is really supposed to be for situations that you don't need to think about. Once you have a weird case that you have to think about, you're going to have to do some extra work that I usually find quite kludgey to integrate with XML::Simple
.
Upvotes: 3
Reputation: 21
If the HTML is included directly in the XML (rather than being escaped or inside a CDATA
) then there is no way for XML::Simple to know where to stop parsing.
However, you can reconstitute just the HTML by passing that section of the data structure to XML::Simple
's XMLout()
function.
Upvotes: 2
Reputation: 118118
#!/usr/bin/perl
use strict; use warnings;
use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
or die "Cannot read XML\n";
if ( $reader->nextElement('content') ) {
print $reader->readInnerXml;
}
__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>
Output:
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>
Upvotes: 3
Reputation: 129363
If the HTML is not inside CDATA construct or otherwise encoded, what you can do is a slight hack.
Before processing with XML::Simple, find the contents of <my_html>
tag which are presumably suspect HTML, and pass them through HTML entity encoder ("<" => "<'" etc...) like HTML::Entities. Then insert encoded content instead of the original content of <my_html>
tag.
This is VERY hacky, VERY easy to do incorrectly unless you know 100% what you're doing with regular expressions, and should not be done.
Having said that, it WILL solve your problem.
Upvotes: 0