Reputation: 2663

How can Perl's XML::Simple ignore HTML embedded in XML?

I have an XML file that I am pulling from the web and parsing. One of the items in the XML is a 'content' value that has HTML. I am using XML::Simple::XMLin to parse the file like so:

$xml= eval { $data->XMLin($xmldata, forcearray => 1, suppressempty=> +'') };

When I use Data::Dumper to dump the hash, I discovered that SimpleXML is parsing the HTML into the hash tree:

'content' => {
      'div' => [
                 {
                   'xmlns' => 'http://www.w3.org/1999/xhtml',
                   'p' => [
                       {
                         'a' => [
                             {
                                'href' => 'http://miamiherald.typepad.com/.a/6a00d83451b26169e20133ec6f4491970b-pi',
                               'style' => 'FLOAT: left',
                               'img' => [
                                   etc.....

This is not what I want. I want to just grab content inside of this entry. How do I do this?

Upvotes: 3

Answers (4)

brian d foy

Reputation: 132920

My general rule is that when XML::Simple starts to fail, it's time to move on to another XML processing module. XML::Simple is really supposed to be for situations that you don't need to think about. Once you have a weird case that you have to think about, you're going to have to do some extra work that I usually find quite kludgey to integrate with XML::Simple.

Upvotes: 3

marnanel

Reputation: 21

If the HTML is included directly in the XML (rather than being escaped or inside a CDATA) then there is no way for XML::Simple to know where to stop parsing.

However, you can reconstitute just the HTML by passing that section of the data structure to XML::Simple's XMLout() function.

Upvotes: 2

Sinan Ünür

Reputation: 118166

#!/usr/bin/perl

use strict; use warnings;

use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
    or die "Cannot read XML\n";

if ( $reader->nextElement('content') ) {
    print $reader->readInnerXml;
}

__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>

Output:

<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>

Upvotes: 3

DVK

Reputation: 129529

If the HTML is not inside CDATA construct or otherwise encoded, what you can do is a slight hack.

Before processing with XML::Simple, find the contents of <my_html> tag which are presumably suspect HTML, and pass them through HTML entity encoder ("<" => "&lt'" etc...) like HTML::Entities. Then insert encoded content instead of the original content of <my_html> tag.

This is VERY hacky, VERY easy to do incorrectly unless you know 100% what you're doing with regular expressions, and should not be done.

Having said that, it WILL solve your problem.

Upvotes: 0

How can Perl&#39;s XML::Simple ignore HTML embedded in XML?

Answers (4)

Related Questions

How can Perl's XML::Simple ignore HTML embedded in XML?