user3152289
user3152289

Reputation: 107

Linux: Command to delete line(s) from XML file with matching string starting with the 2nd occurrence

I have a XML file that looks something like this:

<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>

I would like to delete the "Header" (and not /Header) lines starting with the 2nd occurrence - don't ask why :-). So the output should look something like this (yes, I know that it is not well formed, but I am going to perform other processing on it as well):

<Header version= '1.0' timestamp='2017-01-04T07:10:07'>
   <Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Date>2017-04-18</Date>
   .
   .
   .`
</Header>
<Date>2017-04-18</Date>
   .
   .
   .`
</Header>

I tried:

sed -i '2,${/<Header/d;}' file

but that deleted all the occurrences of Header. Any suggestions?

Thanks

Upvotes: 1

Views: 1051

Answers (2)

Yunnosch
Yunnosch

Reputation: 26753

sed  "/<Header/{p;:a;s/^.*$//;N;s/\n//;/<Header/!p;ba}" input.txt
  • find the first occurence
  • print it
  • start a loop
    • forget the current line
    • get the next
    • get rid of the unwanted newline
    • print it if it is not a match
  • loop

This assumes that your header lines are always a single line. Otherwise it gets tough. In that case, think about whether this might be a XY problem (see comment by Cyrus). I also assume that removing the indentation of the date lines is not actually wanted.

Upvotes: 0

potong
potong

Reputation: 58473

This might work for you (GNU sed):

sed '/^<\/Header/,${/^<Header/d}' file

From the first closing Header tag to the end of the file, remove any lines beginning with a Header tag.

Upvotes: 2

Related Questions