Find and replace xml tags in html string in Perl with regex

Question

I need to find and replace xml tags inside html string which is not complete xml that's why I can not use xml parser to deal with it. So I need to manually find the xml tags and replace them with content inside these html strings.

Example of html string containing the xml tags:

some texthello p
 other text or html open tags like 
So I need to find the xml "vars" tags with their variable number of optional attributes and replace them with contents.

daliaessam · Accepted Answer

Looking at some Perl parsers for XML and HTML like Mojo::DOM as pointed by Miller answer above and also looking at XML::TreePP, I found they are using regex to parse the entire contents, so I tried their regex and got good results just may need some optimizations.

Here is what I did:

my $text =<<'XHTML';
some text
hello p

other text or html open tags like


XHTML

while ( $text =~ m{(](?:"[^"]*"|'[^']*'|[^"'<>/])*)/?>)}sxgi ) {
    my $match = $1;
    my $args = $2;
    #print "[[$match]] \n{{$args}}\n\n";

    #parse name=value attributes, values may be double or single quoted or unquoted
    while ( $args =~ m/([^<>=\s/]+|/)(?:\s*=\s*(?:"([^"]*?)"|'([^']*?)'|([^>\s/]*)))?\s*/sxgi ) {
        my $name = $1;
        #any better solution with regex above to just get $2
        my $value = $2? $2: ($3? $3 : $4);
        print "$name=$value\n";
    }
    print "\n";
}


and here is the output as expected:

type=text
name=fname
single=single quoted
unqouted=noquotes
hastags= color='red' Class::SubClass->color

type=text
name=lname
single1=single quoted
unqouted1=noquotes
hastags1= bgcolor='red' Class::SubClass->bgcolor

name=mname


of course the variable $match in the code has the entire match so I can replace it with my contents.

the second regex that matches the attributes needs optimizations, I am not satisfied with this line :

my $value = $2? $2: ($3? $3 : $4);


can the regex be modified to just get the attribute value in $2.

The regex as used in Mojo::Dom is

my $ATTR_RE = qr/
  ([^<>=\s/]+|/)   # Key
  (?:
    \s*=\s*
    (?:
      "([^"]*?)"     # Quotation marks
    |
      '([^']*?)'     # Apostrophes
    |
      ([^>\s/]*)    # Unquoted
    )
  )?
  \s*
/x;
my $END_RE   = qr!^\s*/\s*(.+)!;
my $TOKEN_RE = qr/
  ([^<]+)?                                          # Text
  (?:
    <\?(.*?)\?>                                     # Processing Instruction
  |

Find and replace xml tags in html string in Perl with regex

Answers (2)

Related Questions