Reputation: 93
I have an xml file where I need to keep the order of the tags but have a tag called media that has duplicate lines in consecutive order. I would like to delete one of the duplicate media tags but want to preserve all of the parent tags - (which are also consecutive and repeat). I'm wondering if there is an awk solution to delete only if a pattern is matched. For example:
<story>
<article>
<media>One line</media>
<media>One line</media> <-- Same line as above, want to delete this
<media>Another Line</media>
<media>Another Line</media> <-- Another duplicate, want to delete this
</article>
</story>
<story>
<article>
........ and so on
I want to keep the consecutive story and article tags and just delete duplicates for the media tag. I've tried a number of awk scripts but nothing seems to work without sorting the file and ruining the order of the xml. Any help much appreciated.
Upvotes: 4
Views: 2295
Reputation: 26667
An awk script would help you
awk '!(f == $0){print} {f=$0}' input
Test
$ cat input
<story>
<article>
<media>One line</media>
<media>One line</media>
<media>Another Line</media>
<media>Another Line</media>
this
</article>
</story>
<story>
<article>
$ awk '!(f == $0){print} {f=$0}' input
<story>
<article>
<media>One line</media>
<media>Another Line</media>
this
</article>
</story>
<story>
<article>
OR
$ awk 'f!=$0&&f=$0' input
Thanks to Jidder
Upvotes: 6
Reputation: 10039
use behaviour of uniq that need normaly a sorted file, removing dupliucate lines tat are following exactly the previous line
uniq YourFile
Upvotes: 3
Reputation: 58473
This might work for you (GNU sed):
sed -r 'N;/^(\s*<media>.*)\n\1$/!P;D' file
This deletes duplicate lines that begin with the <media>
tag.
N.B. This deletes the lines from the front but as they are duplicates it should not matter.
Upvotes: 1
Reputation: 113914
Consider the file:
$ cat file
<story>
<article>
<media>One Line</media>
<media>One Line</media>
<media>Another Line</media>
<media>Another Line</media>
</article>
</story>
<story>
<article>
........ and so on
To remove duplicate media lines and only duplicate media lines:
$ awk '/<media>/ && $0==last{next} {last=$0} 1' file
<story>
<article>
<media>One Line</media>
<media>Another Line</media>
</article>
</story>
<story>
<article>
........ and so on
/<media>/ && $0==last{next}
Any line that has a <media>
tag and matches the previous line is skipped: the command next
tells awk
to skip all remaining commands and start over on the next line.
last=$0
This saves the last line, in its entirety, in the variable last
.
1
This is cryptic awk
notation which means print the current line. If you prefer clarity to conciseness, you may replace the 1
with {print $0}
.
Upvotes: 2