Reputation: 437
I have some text which I would like to match based on tag only appears once. Text is as below (some random chars can contain anything except for tags):
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
The match I want is: to match tag3 within tag2 which only appears once.
For example:
<tag2><tag3>something</tag3></tag2> is matched
<tag2><tag3>something</tag3><tag3>something</tag3></tag2> isn't matched
Based on above text, the expected output is: line 2 and 5.
The regex I tried (didn't work):
<tag2><tag3>(.*)?</tag3></tag2>
<tag2><tag3>(.*){1}</tag3></tag2>
Upvotes: 0
Views: 74
Reputation: 53488
I would urge you not to use regular expressions to manipulate XML - ever. Regular expressions cannot handle a contextual language like XML, and as a result you build brittle code - that a perfectly valid alteration to XML format (such as whitespacing) might break.
So instead:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
foreach my $element ( $twig->get_xpath('//tag2') ) {
if ( scalar $element->children('tag3') == 1 ) {
$element->print;
print "\n";
}
}
__DATA__
<root>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
</root>
This will handle XML formatted as you have, but also just on a single line. Or like this:
<root>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
<tag1>
<tag2>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
<tag3>Some randome chars</tag3>
</tag2>
</tag1>
</root>
Or like this:
<root
><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1></root>
Which are semantically identical to yours.
Upvotes: 4
Reputation: 241958
Use an XML aware tool. I tried the following in xsh, a wrapper around XML::LibXML:
ls //tag2[1=count(tag3)]
After adding line numbers to the tag2's, I got
<tag2>2<tag3>Some randome chars</tag3></tag2>
<tag2>5<tag3>Some randome chars</tag3></tag2>
Upvotes: 1
Reputation: 54333
Your regex didn't work because you were allowing everything (.
) in your capture group. That is very greedy and will go as far as possible and only stop at the last </tag3>
. If you want to match only stuff that cannot inlcude tags, you need to match anything but an opening tag token.
m{<tag2><tag3>([^<]+)</tag3></tag2>}g
Try it on regex101.com.
Upvotes: 2