Reputation: 1396
Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2.
Consider the following text:
<entity name="good">
</entity>
<entity name="bad">
stuff to delete
</entity>
<entity name="bad">
stuff to remove
</entity>
<entity name="bad2">
</entity>
<entity name="deleteMe2">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>
I would like the following outcome
<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>
I know how to do a range in sed, but can't figure out how to match the last occurrence of 'bad2' and not include it in the delete. The below of course will not work as it will match the first bad2 and not remove the 'deleteme2' or 2nd occurrenc of 'bad2'.
sed -i '/<entity name="bad"/,/<entity name="bad2"/d' file.xml
There can be hundreds of 'bad'/'deleteMe2'/'bad2' lines in the file I am dealing with, so a simple line count won't work. I am fine if this is multiple commands (it does not have to be just a single one), but the more efficient the better because the file being modified can be quite large. As well, the -i is because I want to do an in place delete of the lines between.
NOTE: I am more familiar with SED than I am with AWK, but I am open to all the help I can get:)
Upvotes: 0
Views: 171
Reputation: 58483
This might work for you (GNU sed):
sed '/bad/,$!b;/bad2/h;//!H;$!d;g;/bad2/!d' file
Lines that are not between bad
and the end of the file, print as normal. Otherwise store those lines in the hold space overwriting those stored lines when matching bad2
. Delete all lines but the last, replacing it with the contents of the hold space. Delete the line unless it matches bad2
.
Upvotes: 0
Reputation: 204015
$ cat tst.awk
NR==FNR {
if (/"bad"/ && !begFnr) {
begFnr = FNR
}
if (/"bad2"/) {
endFnr = FNR
}
next
}
(FNR < begFnr) || (FNR >= endFnr)
$ awk -f tst.awk file file
<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>
Upvotes: 1
Reputation: 67507
awk
to the rescue!
$ awk 'NR==FNR&&/\"bad\"/&&!s{s=NR;next}
NR==FNR&&/\"bad2\"/{e=NR;next}
NR!=FNR && (FNR<s || FNR>=e)' xml{,}
<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>
I guess can be simplified further. Two pass script to mark the line numbers first and print the second time.
Upvotes: 0
Reputation: 53498
This looks like XML to me, so I would strongly suggest that regex
isn't the tool for the job. Use a parser instead:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> new -> parsefile ( 'your_file.xml' ) ;
$_ -> delete for $twig -> findnodes ( '//entity[@name="bad"]');
$twig -> set_pretty_print('indented_a');
$twig -> print;
Or perhaps more comprehensively:
for my $entity ( $twig -> findnodes ( '//entity') ) {
if ( $entity -> att('name') eq "bad"
or $entity -> att('name') eq "deleteMe2" ) {
$entity -> delete;
}
}
To delete only the first instance of 'bad2' you can just call findnodes
once, and delete the first 'hit'.
Upvotes: 1