Reputation: 441
There is some data (xml) in a file, and I need to remove text (not the whole line, so /d option of sed does not suit) from Substring1 up to Substring2 (including both) only if contains a pattern. My problem here is that there could be various formatting, so Substring1 and Substring2 can be either on the same line or on different, or there could be several pairs of Substrin1/2 on the same line.
Example (1st line - 2 pairs of Substrings1/2 and first one contains PATTERN, 2nd line - 1 pair with PATTERN, 3rd line - 1 pair without PATTERN, 4th and 5th lines - 1 pair with PATTERN, 6th and 7th lines - 1 pair without PATTERN):
Substring1 = <?xml
Substring2 = </update>
Pattern = PATTERN
tmp.log
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>
Expected output:
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>
I`ve tried (without full success) different combinations like the following:
sed -i "s#<?xml.*PATTERN.*</update>##g" tmp.log
sed -i "#<?xml#{p; :a; N; #</update>#!ba; s#.*\n##}; p" tmp.log
perl -pi -e 's/<?xml.*PATTERN.*update>//' tmp.log
As far as I can see, these remove whole lines and skip the case when substrings are located on different lines. I also do not perform real checking for PATTERN here. Any help appreciated.
Upvotes: 0
Views: 103
Reputation: 66883
If there is actually any more of this please use the good modules for XML. Both XML::libXML
and XML::Twig
are excellent. That said, here is direct parsing.
use warnings;
use strict;
# Sample text for testing
my $text = q(start <?xml with PATTERN yes </update> and <?xml good </update> end);
my $beg = qr(<\?xml);
my $end = qr(</update>);
my $patt = qr(PATTERN);
$text =~ s|$beg.*?$patt.*?$end||gs;
print "$text\n";
The .*?
is non-greedy. The newlines are taken care of by the modifier /s
which makes .
match them. Since the text in the question is unclear to me I've used the $text
above as input:
start <?xml with PATTERN yes </update> and <?xml good </update> end
With this input in $text
, the above code prints
start and <?xml good </update> end
Upvotes: 1
Reputation: 2589
Please try this one:
use strict;
use warnings;
my $newDATA = "";
while(<DATA>)
{
my $each_line = $_; my $dump = $each_line;
my ($pre,$match,$post) = "";
while($each_line=~/<\?xml((?:(?!<\?xml|\n).)*)<\/update>/sg)
{
$pre = $pre.$`; $match=$&; $post = $'; my $dupmatch = $match;
if($dupmatch=~m/PATTERN/i)
{ $match = ""; }
$pre = $pre.$match; $each_line = $post;
}
if(length $pre) { $each_line = $pre.$post; }
$newDATA .= $each_line;
}
$newDATA=~s/\n{,1}/\n/g;
print $newDATA;
INPUT:
__DATA__
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>
OUTPUT:
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <update>2016-03-24</update><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <update>2016-03-24</update>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>
Your XML tagging is very inconsistent. Could you please check and the above perl coding.
Upvotes: 0
Reputation: 2662
With gawk:
awk -v RS='<\\?xml' 'NR!=1 && !(/PATTERN/){print "<?xml",$0}'
Upvotes: 2