Blundell
Blundell

Reputation: 76506

Bash script with Perl multi-line regex (OSX)

I have a file containing this XML data:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <item>
    <tag1>some text</tag1>
    <tag2><![CDATA[http://url1.com]]></tag2>
    <tag3 />
    <tag4>not empty node</tag4>
  </item>
  <item>
    <tag1>some other text</tag1>
    <tag2><![CDATA[http://www.url.com]]></tag2>
    <tag3 />
    <tag4 />
  </item>
</root>

(and a lot more XML inside)

I am trying to write a Bash script to remove some of the XML. Namely, I want to remove every <item> element that has an empty <tag4> child element.

Therefore I want to find <item> then find <tag4/> then find </item>, group this and replace with an X char.

I haven't even got up to the grouping yet, I have got stuck on doing a regex over multiple lines.

Running on Mac OSX

This is what I have got:

 perl -pn -e "s/<item>[\s\S]*<tag4 \/>/X/g" $XML_FILENAME > new_folder/$XML_FILENAME

if I remove the [\s\S]* (which means any space character or any character, I can replace the <item> tag, but I can't get to the next tag or next line.

(I also tried echo// and sed getting stuck in a similar position)

Upvotes: 1

Views: 1918

Answers (5)

jaypal singh
jaypal singh

Reputation: 77105

One way with GNU awk:

awk '
BEGIN {
    ORS=""
    RS="<[/]?item>"
    f1="<item>"
    f2="<\/item>"
}
!/<tag4 \/>/ && NF { 
    print ($0~/tag/)?f1 $0 f2:$0
}' xmlfile

Upvotes: 0

michael501
michael501

Reputation: 1482

this works but needs a little work :

 perl -00 -ne 's/<item>.*<\/tag4>.*<\/item>/X/gs;print "$_\n";' test.xml 

Upvotes: 0

Ansgar Wiechers
Ansgar Wiechers

Reputation: 200303

Better use an actual XML parser for this (e.g. XML::LibXML) and select empty <tag4> nodes with an XPath expression:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::LibXML;

my $xml = XML::LibXML->new->parse_file('/path/to/input.xml');

$_->unbindNode for $xml->findnodes('//item[not(tag4/text())]');

print $xml->toString;

If you want to directly save the modified XML to a file, replace the line

print $xml->toString;

with

$xml->toFile('/path/to/output.xml');

Upvotes: 3

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89567

try this:

s/<item>(?>[^<]++|<(?!tag4))*<tag4 \/>(?>[^<]++|<(?!\/item>))*<\/item>/X/g

This pattern avoids the newline problem because it doesn't use the dot.

Explanations

detail of (?>[^<]++|<(?!tag4))*

(?>                # open an atomic group
      [^<]++       # all that is not a < one or more times (possessive)
    |              # OR
      <(?!tag4)    # a < not followed by tag4
)*                 # close the atomic group, repeat zero or more times

Using this trick, I am sure that what follows is <tag4 (or the end of the string)

I use atomic groups (?>..) and possessive quantifiers ++ for more performances, but you can replace them by normal groups (?:..) and greedy quantifiers +

Notices

Or you can just use a lazy quantifier replacing [\s\S]* by [\s\S]*?

Note that with perl you can use the dotall mode instead of [\s\S] adding the s modifier:

 (?s).*          # the dot matches newlines
 (?-s).*         # the dot doesn't match newlines (default behavior)

Upvotes: 2

Borodin
Borodin

Reputation: 126732

Using regular expressions to process XML is impractical. You should use a proper Perl module.

This short program uses XML::Twig to process the file whose name is passed as a command-line parameter. It sends the modified XML to STDOUT.

use utf8;
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new(pretty_print => 'indented');
$twig->parsefile($ARGV[0]);

for my $twig ($twig->findnodes('/root/item')) {
  $twig->delete unless $twig->findvalue('tag4') =~ /\S/;
}

$twig->print;

output

<?xml version="1.0" encoding="utf-8"?>
<root>
  <item>
    <tag1>some text</tag1>
    <tag2><![CDATA[http://url1.com]]></tag2>
    <tag3/>
    <tag4>not empty node</tag4>
  </item>
</root>

Upvotes: 2

Related Questions