dellair
dellair

Reputation: 437

Perl regular expression to match embedded tag once

I have some text which I would like to match based on tag only appears once. Text is as below (some random chars can contain anything except for tags):

<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>

The match I want is: to match tag3 within tag2 which only appears once.

For example:

<tag2><tag3>something</tag3></tag2> is matched
<tag2><tag3>something</tag3><tag3>something</tag3></tag2> isn't matched

Based on above text, the expected output is: line 2 and 5.

The regex I tried (didn't work):

<tag2><tag3>(.*)?</tag3></tag2>
<tag2><tag3>(.*){1}</tag3></tag2>

Upvotes: 0

Views: 74

Answers (3)

Sobrique
Sobrique

Reputation: 53488

I would urge you not to use regular expressions to manipulate XML - ever. Regular expressions cannot handle a contextual language like XML, and as a result you build brittle code - that a perfectly valid alteration to XML format (such as whitespacing) might break.

So instead:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->parse( \*DATA );

foreach my $element ( $twig->get_xpath('//tag2') ) {
   if ( scalar $element->children('tag3') == 1 ) {
      $element->print;
      print "\n";
   }
}

__DATA__
<root>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
</root>

This will handle XML formatted as you have, but also just on a single line. Or like this:

<root>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
</root>

Or like this:

<root
><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1></root>

Which are semantically identical to yours.

Upvotes: 4

choroba
choroba

Reputation: 241958

Use an XML aware tool. I tried the following in xsh, a wrapper around XML::LibXML:

ls //tag2[1=count(tag3)]

After adding line numbers to the tag2's, I got

<tag2>2<tag3>Some randome chars</tag3></tag2>
<tag2>5<tag3>Some randome chars</tag3></tag2>

Upvotes: 1

simbabque
simbabque

Reputation: 54333

Your regex didn't work because you were allowing everything (.) in your capture group. That is very greedy and will go as far as possible and only stop at the last </tag3>. If you want to match only stuff that cannot inlcude tags, you need to match anything but an opening tag token.

m{<tag2><tag3>([^<]+)</tag3></tag2>}g

Try it on regex101.com.

Upvotes: 2

Related Questions