Perl parse xml tags manually using regular expression

Question

I have html content snippet, which contains custom xml tags with attributes or cdata and may have text nodes.

The content snippet is not well formed xml, so I think I can not use xml parser modules.

Here is sample html content snippet:

Hello world, mixed html and xml content
google

First content section
Here is the first content section
Second content section

Attributes may contains single or double quotes, can we skip double quotes in attributes

Assuming I have the name space fw, I need to find and replace all fw xml tags with the program output for each tag.

simbabque · Accepted Answer

I made a VERY PRAGMATIC solution to this. It's far from perfect, it uses a lot of things that I would not want to use in production code, and it probably breaks on some of the things your real data has. It does work for the example, though.

Before looking at the code, let's notice a few things that make the XML hard to parse:

your CDATA opening is wrong. You are using . There is one [ too many. It's supposed to be .


the double-quotes within the attribute break XML parsers


I fixed these issues by simply repairing them with a regex. As I said, it is very pragmatic. I do not claim that this is a very good solution.
So here's the code:
use strict; use warnings;
use XML::Simple;

my $html = <Hello world, mixed html and xml content
google

First content section
Here is the first content section
Second content section

Attributes may contains single or double quotes, can we skip double quotes in attributes


HTML

# dispatch table
my %dispatch = (
  content => sub {
    my ($attr) = @_;
    return qq{Content: $attr->{content}};
  },
  blog => sub {
    my ($attr) = @_;
    return qq{Blog: $attr->{content}};
  },
  lang => sub {
    my ($attr) = @_;
    return "FooLanguage";
  }
);

# pragmatic repairs based on the example given:
# CDATA only has two brackets, not three, and the closing one is right
$html =~ s/]+/>)}{parse($1)}ge;
# replace tags with a closing tag (see http://regex101.com/r/bB0kB5)
$html =~ s{
  (                # group to $1
    <
      (            # group to $2 and \2
        fw:        # start with namespace-prefix
        [a-zA-z]+  # find tagname
      )            # end of $2
      [^>]*        # match everything until the next > (or nothing)
    >              # end of tag
    (?:
      [^<]+                 # all the stuff before the closing tag
      |                       # or
         # a CDATA section
    )
          # the closing tag is the same as the opening (\2)
  )
}
{
  parse($1)        # dispatch
}gex; # x adds extended readability (i.e. quotes)


print $html;

sub parse {
  my ($string) = @_;

  # pragmatic repairs based on the example given:
  # there can be no unescaped quotes within quotes,
  # but there are no empty attributs either
  $string =~ s/""/{double-double-quote}/g;                

  # read with XML::Simple and fetch tagname as well as attributes
  my ( $name, $attr ) = each %{ XMLin($string, KeepRoot => 1 ) };
  
  # get rid of the namespace
  $name =~ s/^[^:]+://;
  
  # restore quotes
  s/{double-double-quote}/""/ for values %$attr;
  
  # dispatch
  return $dispatch{$name}->($attr);
}

How does this work?

I'm assuming all the processing instructions are within tags that have the fw: namespace.
There are three types of instruction in the example: content, blog and lang. I have no idea what they are supposed to do, so I made that up.
I created a dispatch table. That's a hash with the instructions as keys and coderefs as values. A very good resource on this is the book Higher Order Perl by Mark Jason Dominus.
I fixed the CDATA problem globally in the HTML/XML string.
There are two regexes that take care of substituting the instructions with the actual content. They are using the /e flag, which executes Perl code in the substitution part of the s///.

The first one finds all tags that do not have a closing tag, i.e. .
The second one is more complicated. It deals with ... and also handles the CDATA in the content. There is no support for CDATA in attributes! The regex uses the /x flag to allow for comments and indentation. For an explanation of the regex, see http://regex101.com/r/bB0kB5.


My parse() sub takes the complete matched tag and does stuff to it:

Replace the double-double-quotes with a placeholder. If there is a real instance of quoted stuff inside an attribute, it will break!  will not work. You will have to find a way of dealing with these.
It uses XML::Simple to break down the tag into a hashref with attributes. The KeepRoot option puts the tag name as the key, so we get { foo => { attr1 => 'bar', attr2 => 'baz' }}. I'm using the each built-in to split this up in key and value directly.
Replace the escaped double-quotes back.
Dispatch the instruction (which is in $name) through the dispatch table. The syntax to invoke a coderef with params is $coderef->($arg), but we are using a hash value. We pass the hashref that XML::Simple created from the attributes (and content, but it ends up like an attribute named content).



I'd like to  stress again that this will probably not even work on your real data, but it might give some ideas as to how to solve it pragmatically.

Perl parse xml tags manually using regular expression

Answers (1)

Related Questions