Reputation: 76506
I have a file containing this XML data:
<?xml version="1.0" encoding="utf-8"?>
<root>
<item>
<tag1>some text</tag1>
<tag2><![CDATA[http://url1.com]]></tag2>
<tag3 />
<tag4>not empty node</tag4>
</item>
<item>
<tag1>some other text</tag1>
<tag2><![CDATA[http://www.url.com]]></tag2>
<tag3 />
<tag4 />
</item>
</root>
(and a lot more XML inside)
I am trying to write a Bash script to remove some of the XML. Namely, I want to remove every <item>
element that has an empty <tag4>
child element.
Therefore I want to find <item>
then find <tag4/>
then find </item>
, group this and replace with an X
char.
I haven't even got up to the grouping yet, I have got stuck on doing a regex over multiple lines.
Running on Mac OSX
This is what I have got:
perl -pn -e "s/<item>[\s\S]*<tag4 \/>/X/g" $XML_FILENAME > new_folder/$XML_FILENAME
if I remove the [\s\S]*
(which means any space character or any character, I can replace the <item>
tag, but I can't get to the next tag or next line.
(I also tried echo//
and sed
getting stuck in a similar position)
Upvotes: 1
Views: 1918
Reputation: 77105
One way with GNU awk
:
awk '
BEGIN {
ORS=""
RS="<[/]?item>"
f1="<item>"
f2="<\/item>"
}
!/<tag4 \/>/ && NF {
print ($0~/tag/)?f1 $0 f2:$0
}' xmlfile
Upvotes: 0
Reputation: 1482
this works but needs a little work :
perl -00 -ne 's/<item>.*<\/tag4>.*<\/item>/X/gs;print "$_\n";' test.xml
Upvotes: 0
Reputation: 200303
Better use an actual XML parser for this (e.g. XML::LibXML
) and select empty <tag4>
nodes with an XPath expression:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::LibXML;
my $xml = XML::LibXML->new->parse_file('/path/to/input.xml');
$_->unbindNode for $xml->findnodes('//item[not(tag4/text())]');
print $xml->toString;
If you want to directly save the modified XML to a file, replace the line
print $xml->toString;
with
$xml->toFile('/path/to/output.xml');
Upvotes: 3
Reputation: 89567
try this:
s/<item>(?>[^<]++|<(?!tag4))*<tag4 \/>(?>[^<]++|<(?!\/item>))*<\/item>/X/g
This pattern avoids the newline problem because it doesn't use the dot.
Explanations
detail of (?>[^<]++|<(?!tag4))*
(?> # open an atomic group
[^<]++ # all that is not a < one or more times (possessive)
| # OR
<(?!tag4) # a < not followed by tag4
)* # close the atomic group, repeat zero or more times
Using this trick, I am sure that what follows is <tag4
(or the end of the string)
I use atomic groups (?>..)
and possessive quantifiers ++
for more performances, but you can replace them by normal groups (?:..)
and greedy quantifiers +
Notices
Or you can just use a lazy quantifier replacing [\s\S]*
by [\s\S]*?
Note that with perl you can use the dotall mode instead of [\s\S]
adding the s modifier:
(?s).* # the dot matches newlines
(?-s).* # the dot doesn't match newlines (default behavior)
Upvotes: 2
Reputation: 126732
Using regular expressions to process XML is impractical. You should use a proper Perl module.
This short program uses XML::Twig
to process the file whose name is passed as a command-line parameter. It sends the modified XML to STDOUT
.
use utf8;
use strict;
use warnings;
use XML::Twig;
my $twig= XML::Twig->new(pretty_print => 'indented');
$twig->parsefile($ARGV[0]);
for my $twig ($twig->findnodes('/root/item')) {
$twig->delete unless $twig->findvalue('tag4') =~ /\S/;
}
$twig->print;
output
<?xml version="1.0" encoding="utf-8"?>
<root>
<item>
<tag1>some text</tag1>
<tag2><![CDATA[http://url1.com]]></tag2>
<tag3/>
<tag4>not empty node</tag4>
</item>
</root>
Upvotes: 2