Martin
Martin

Reputation: 119

bash (sed or awk preferred) to remove everything between first and last instance

I'm pretty familiar with sed but I don't know awk very well, and I'm not sure how to solve this problem. I've googled for a while but no luck so far. Here's the situation: I've got a big file with groups and sections, like so:

<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<B1>
  some
  nr of
  lines
</B1>
<B2>
  some nr of lines
</B2>
<B3>
  bla
</B3>
<C1>
  bla
</C1>
<C2>
  bla
</C2>

Now the problem is that the number of groups can change, the number of sections can change, and the number of lines in each section can change. For example, section A might go to 25, section B might go to 8, and so on. What I need to do is remove all entries of certain groups, in the example above I'd like to remove everything in <B*>, leaving me with the following:

<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<C1>
  bla
</C1>
<C2>
  bla
</C2>

Additionally, there would be several sections I would want to remove (although these can be in separate runs), for example if the file goes from A1 to R123, I'd want to remove B*, F*, M*, etc.

If something similar has already been asked and answered somewhere I apologize, I did try to find a solution before posting.

Thanks!

Upvotes: 2

Views: 1191

Answers (2)

Ed Morton
Ed Morton

Reputation: 204721

I think what you're looking for is something like this:

awk -v rmv="AC" 'BEGIN{
   gsub(/./,"|&",rmv)
   sub(/$/,")[0-9]+>$",rmv)
   start = end = rmv
   sub(/^\|/,"^<(",start)
   sub(/^\|/,"^</(",end)
}
$0 ~ start { f=1 }
!f
$0 ~ end   { f=0 }
' file

Just populate the "rmv" variable with the list of all the sections you want removed:

$ awk -v rmv="B" '...'
<A1>
  some nr of lines
</A1>
<A2>
  some nr
  of lines
</A2>
<C1>
  bla
</C1>
<C2>
  bla
</C2>
$ awk -v rmv="AC" '...'
<B1>
  some
  nr of
  lines
</B1>
<B2>
  some nr of lines
</B2>
<B3>
  bla
</B3>
$

Upvotes: 1

anubhava
anubhava

Reputation: 786339

Using sed:

sed '/<B1>/,/<\/B3>/d' infile

Which means find a range of text starting from <B1> and ending at </B3> and delete it from sed's output. (that means sed will print rest of file on stdout)

EDIT: This will also work for your case:

sed '/<B[0-9]*>/,/<\/B[0-9]*>/d' 

Upvotes: 6

Related Questions