Reputation: 1666
I have html content snippet, which contains custom xml tags with attributes or cdata and may have text nodes.
The content snippet is not well formed xml, so I think I can not use xml parser modules.
Here is sample html content snippet:
<p>Hello world, mixed html and xml content</p>
<a href="http://google.com/">google</a>
<fw:blog id="title" content="hellow world" size="30" width="200px" />
<b>First content section</b>
<fw:content id="middle" width="400px" height="300px">Here is the first content section</fw:content>
<b>Second content section</b>
<fw:content id="left-part" width="400px" height="300px"><![[CDATA[ Here is the first content section]]></fw:content>
<b>Attributes may contains single or double quotes, can we skip double quotes in attributes</b>
<fw:blog id="title" content="what's your name, I may"" be cool" size="30" width="200px" />
<fw:lang id="home" />
Assuming I have the name space fw
, I need to find and replace all fw
xml tags with the program output for each tag.
Upvotes: 1
Views: 1054
Reputation: 54381
I made a VERY PRAGMATIC solution to this. It's far from perfect, it uses a lot of things that I would not want to use in production code, and it probably breaks on some of the things your real data has. It does work for the example, though.
Before looking at the code, let's notice a few things that make the XML hard to parse:
CDATA
opening is wrong. You are using <![[CDATA[
. There is one [
too many. It's supposed to be <![CDATA[
.I fixed these issues by simply repairing them with a regex. As I said, it is very pragmatic. I do not claim that this is a very good solution.
So here's the code:
use strict; use warnings;
use XML::Simple;
my $html = <<HTML;
<p>Hello world, mixed html and xml content</p>
<a href="http://google.com/">google</a>
<fw:blog id="title" content="hellow world" size="30" width="200px" />
<b>First content section</b>
<fw:content id="middle" width="400px" height="300px">Here is the first content section</fw:content>
<b>Second content section</b>
<fw:content id="left-part" width="400px" height="300px"><![[CDATA[ Here is the first content section]]></fw:content>
<b>Attributes may contains single or double quotes, can we skip double quotes in attributes</b>
<fw:blog id="title" content="what's your name, I may"" be cool" size="30" width="200px" />
<fw:lang id="home" />
HTML
# dispatch table
my %dispatch = (
content => sub {
my ($attr) = @_;
return qq{<div width="$attr->{width}" id="$attr->{id}">Content: $attr->{content}</div>};
},
blog => sub {
my ($attr) = @_;
return qq{<p width="$attr->{width}" id="$attr->{id}">Blog: $attr->{content}</p>};
},
lang => sub {
my ($attr) = @_;
return "<p>FooLanguage</p>";
}
);
# pragmatic repairs based on the example given:
# CDATA only has two brackets, not three, and the closing one is right
$html =~ s/<!\[\[CDATA\[/<![CDATA[/;
# replace tags that do not have a closing tag
$html =~ s{(<fw:[^>]+/>)}{parse($1)}ge;
# replace tags with a closing tag (see http://regex101.com/r/bB0kB5)
$html =~ s{
( # group to $1
<
( # group to $2 and \2
fw: # start with namespace-prefix
[a-zA-z]+ # find tagname
) # end of $2
[^>]* # match everything until the next > (or nothing)
> # end of tag
(?:
[^<]+ # all the stuff before the closing tag
| # or
<!\[CDATA\[.+?\]\]> # a CDATA section
)
</ \2 > # the closing tag is the same as the opening (\2)
)
}
{
parse($1) # dispatch
}gex; # x adds extended readability (i.e. quotes)
print $html;
sub parse {
my ($string) = @_;
# pragmatic repairs based on the example given:
# there can be no unescaped quotes within quotes,
# but there are no empty attributs either
$string =~ s/""/{double-double-quote}/g;
# read with XML::Simple and fetch tagname as well as attributes
my ( $name, $attr ) = each %{ XMLin($string, KeepRoot => 1 ) };
# get rid of the namespace
$name =~ s/^[^:]+://;
# restore quotes
s/{double-double-quote}/""/ for values %$attr;
# dispatch
return $dispatch{$name}->($attr);
}
How does this work?
fw:
namespace.content
, blog
and lang
. I have no idea what they are supposed to do, so I made that up.CDATA
problem globally in the HTML/XML string./e
flag, which executes Perl code in the substitution part of the s///
.
<foo />
.<foo>...</foo>
and also handles the CDATA
in the content. There is no support for CDATA
in attributes! The regex uses the /x
flag to allow for comments and indentation. For an explanation of the regex, see http://regex101.com/r/bB0kB5.parse()
sub takes the complete matched tag and does stuff to it:
<foo attr="this is "quoted" stuff">
will not work. You will have to find a way of dealing with these.KeepRoot
option puts the tag name as the key, so we get { foo => { attr1 => 'bar', attr2 => 'baz' }}
. I'm using the each
built-in to split this up in key and value directly.$name
) through the dispatch table. The syntax to invoke a coderef with params is $coderef->($arg)
, but we are using a hash value. We pass the hashref that XML::Simple created from the attributes (and content, but it ends up like an attribute named content
).I'd like to stress again that this will probably not even work on your real data, but it might give some ideas as to how to solve it pragmatically.
Upvotes: 2