user1625714
user1625714

Reputation: 93

Delete duplicate consecutive lines without sort or unique in xml file

I have an xml file where I need to keep the order of the tags but have a tag called media that has duplicate lines in consecutive order. I would like to delete one of the duplicate media tags but want to preserve all of the parent tags - (which are also consecutive and repeat). I'm wondering if there is an awk solution to delete only if a pattern is matched. For example:

<story>
   <article>
      <media>One line</media>
      <media>One line</media>    <-- Same line as above, want to delete this
      <media>Another Line</media>
      <media>Another Line</media>  <-- Another duplicate, want to delete this
   </article>
</story>
<story>
   <article>
     ........ and so on

I want to keep the consecutive story and article tags and just delete duplicates for the media tag. I've tried a number of awk scripts but nothing seems to work without sorting the file and ruining the order of the xml. Any help much appreciated.

Upvotes: 4

Views: 2295

Answers (4)

nu11p01n73R
nu11p01n73R

Reputation: 26667

An awk script would help you

awk '!(f == $0){print} {f=$0}' input

Test

$ cat input
<story>
   <article>
      <media>One line</media>
      <media>One line</media>
      <media>Another Line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

$ awk '!(f == $0){print} {f=$0}' input
<story>
   <article>
      <media>One line</media>
      <media>Another Line</media>
this
   </article>
</story>
<story>
   <article>

OR

$ awk 'f!=$0&&f=$0' input

Thanks to Jidder

Upvotes: 6

NeronLeVelu
NeronLeVelu

Reputation: 10039

use behaviour of uniq that need normaly a sorted file, removing dupliucate lines tat are following exactly the previous line

uniq YourFile

Upvotes: 3

potong
potong

Reputation: 58473

This might work for you (GNU sed):

sed -r 'N;/^(\s*<media>.*)\n\1$/!P;D' file

This deletes duplicate lines that begin with the <media> tag.

N.B. This deletes the lines from the front but as they are duplicates it should not matter.

Upvotes: 1

John1024
John1024

Reputation: 113914

Consider the file:

$ cat file
<story>
   <article>
      <media>One Line</media>
      <media>One Line</media>
      <media>Another Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

To remove duplicate media lines and only duplicate media lines:

$ awk '/<media>/ && $0==last{next} {last=$0} 1' file
<story>
   <article>
      <media>One Line</media>
      <media>Another Line</media>
   </article>
</story>
<story>
   <article>
     ........ and so on

How it works

  • /<media>/ && $0==last{next}

    Any line that has a <media> tag and matches the previous line is skipped: the command next tells awk to skip all remaining commands and start over on the next line.

  • last=$0

    This saves the last line, in its entirety, in the variable last.

  • 1

    This is cryptic awk notation which means print the current line. If you prefer clarity to conciseness, you may replace the 1 with {print $0}.

Upvotes: 2

Related Questions