Vagelis Prokopiou
Vagelis Prokopiou

Reputation: 2693

Delete lines by multiple patterns in specific range of lines

I have the following (simplified) file:

 <RESULTS>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,3</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,1</COLUMN>
  </ROW>
  <ROW>
    <COLUMN NAME="TITLE">title 1</COLUMN>
    <COLUMN NAME="VERSION">1,2</COLUMN>
  </ROW>
</RESULTS>

What I am trying to achieve is to delete all ROW elements that match on the title, but do not match on the latest VERSION (in this case 1,3). So, what I have in mind is something like the following with sed:

sed -i '/<ROW>/,/<\/ROW>/<COLUMN NAME=\"TITLE\">title 1.*<COLUMN NAME=\"VERSION\">^1,3<\/COLUMN>/d' file

The expected output should be the following:

<RESULTS>
<ROW>
  <COLUMN NAME="TITLE">title 1</COLUMN>
  <COLUMN NAME="VERSION">1,3</COLUMN>
</ROW>
</RESULTS>

Unfortunately, this did not work, neither did anything that I tried. I searched a lot for similar issues, but nothing worked for me. Is there a way of achieving it with any Linux command line utility (sed, awk, etc)?

Thanks a lot in advance.

Upvotes: 0

Views: 181

Answers (2)

potong
potong

Reputation: 58420

This might work for you (GNU sed):

sed '/<ROW>/{:a;N;/<\/ROW>/!ba;/TITLE.*title 1/!b;/VERSION.*1,3/b;d}' file

Gather up lines between <ROW> and </ROW>.

If the lines collected don't contain the correct title, bail out.

If the lines collected do contain the correct version bail out.

Otherwise delete the lines collected.

Upvotes: 2

Beta
Beta

Reputation: 99094

/<ROW>/,/<\/ROW>/ won't work, because sed uses greedy matching; it matches everything from the first /<ROW>/ to the last /<\/ROW>/.

You'll have to use one of the advanced features of sed. The simplest is probably the hold space.

This:

sed -n '/<ROW>/{h;d;};H;`

will store an entire ROW block in the hold space, and overwrite it when it encounters a new ROW block. (And print nothing.)

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;p;}

will store the entire ROW block, then print it out when it is complete.

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;p;}'

will do the same, but will delete a block that does not contain "title 1".

This:

sed -n '/<ROW>/{h;d;};H;/<\/ROW>/{g;/title 1/!d;/1,3/p;}'

will do the same, but print only if the block contains "1,3". (You can spell out the matching lines more explicitly; I'm trying to keep this code concise.)

Upvotes: 2

Related Questions