Reputation: 441

Remove text between substrings (no matter on the same line or multiline) only if it contains pattern

There is some data (xml) in a file, and I need to remove text (not the whole line, so /d option of sed does not suit) from Substring1 up to Substring2 (including both) only if contains a pattern. My problem here is that there could be various formatting, so Substring1 and Substring2 can be either on the same line or on different, or there could be several pairs of Substrin1/2 on the same line.

Example (1st line - 2 pairs of Substrings1/2 and first one contains PATTERN, 2nd line - 1 pair with PATTERN, 3rd line - 1 pair without PATTERN, 4th and 5th lines - 1 pair with PATTERN, 6th and 7th lines - 1 pair without PATTERN):

Substring1 = <?xml

Substring2 = </update>

Pattern = PATTERN

tmp.log
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

Expected output:
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

I`ve tried (without full success) different combinations like the following:

sed -i "s#<?xml.*PATTERN.*</update>##g" tmp.log

sed -i "#<?xml#{p; :a; N; #</update>#!ba; s#.*\n##}; p" tmp.log

perl -pi -e 's/<?xml.*PATTERN.*update>//' tmp.log

As far as I can see, these remove whole lines and skip the case when substrings are located on different lines. I also do not perform real checking for PATTERN here. Any help appreciated.

Upvotes: 0

Answers (3)

zdim

Reputation: 66883

If there is actually any more of this please use the good modules for XML. Both XML::libXML and XML::Twig are excellent. That said, here is direct parsing.

use warnings;
use strict;

# Sample text for testing
my $text = q(start <?xml with PATTERN yes </update> and <?xml good </update> end); 

my $beg  = qr(<\?xml);
my $end  = qr(</update>);
my $patt = qr(PATTERN);

$text =~ s|$beg.*?$patt.*?$end||gs;

print "$text\n";

The .*? is non-greedy. The newlines are taken care of by the modifier /s which makes . match them. Since the text in the question is unclear to me I've used the $text above as input:

start <?xml with PATTERN yes </update> and <?xml good </update> end

With this input in $text, the above code prints

start  and <?xml good </update> end

Upvotes: 1

ssr1012

Reputation: 2589

Please try this one:

use strict;
use warnings;

my $newDATA = "";
while(<DATA>)
{
    my $each_line = $_;  my $dump = $each_line;
        my ($pre,$match,$post) = "";
        while($each_line=~/<\?xml((?:(?!<\?xml|\n).)*)<\/update>/sg)
        {
            $pre = $pre.$`; $match=$&; $post = $'; my $dupmatch = $match;
            if($dupmatch=~m/PATTERN/i)
            {  $match = "";  }
            $pre = $pre.$match; $each_line = $post;
        }
        if(length $pre) {  $each_line = $pre.$post;  }
        $newDATA .= $each_line;
}
$newDATA=~s/\n{,1}/\n/g;
print $newDATA;

INPUT:

__DATA__
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

OUTPUT:

<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

Your XML tagging is very inconsistent. Could you please check and the above perl coding.

Upvotes: 0

jijinp

Reputation: 2662

With gawk:

awk -v RS='<\\?xml' 'NR!=1 && !(/PATTERN/){print "<?xml",$0}'

Upvotes: 2

Remove text between substrings (no matter on the same line or multiline) only if it contains pattern

Answers (3)

Related Questions