user1889034
user1889034

Reputation: 343

sed (awk?) to remove nearly duplicate lines

I have a file that alternates HTML-style comments with its real text:

<!-- Here's a first line -->
Here's a first line
<!-- Here's a second line -->
Here's a third line

If a comment is identical to the following line apart from the tags themselves, I want to delete it, but otherwise leave it:

Here's a first line
<!-- Here's a second line -->
Here's a third line

I've read the similar questions here, but been unable to extrapolate the solutions to my situation.

Upvotes: 0

Views: 71

Answers (3)

potong
potong

Reputation: 58351

This might work for you (GNU sed):

sed -r '$!N;/<!-- (.*) -->\n\1$/!P;D' file

This compares all consecutive lines throughout the file for the requested condition and if found does not print the first line of the pair.

N.B. This caters for consecutive comment lines

Upvotes: 1

Jeff Bowman
Jeff Bowman

Reputation: 95614

sed '/^<!-- \(.*\) -->$/N;s/^<!-- \(.*\) -->\n\1$/\1/'
#
#    /^<!-- \(.*\) -->$/   match an HTML comment as its own line, in which case
#                       N; add the next line to the pattern space and keep going
# 
#                         s/^<!-- \(.*\) -->\n\1$/     detect a comment as you
#                                                 \1/  described and replace it
#                                                      appropriately

As shown:

$ sed '/^<!-- \(.*\) -->$/N;s/^<!-- \(.*\) -->\n\1$/\1/' <<EOF
> <!-- Foo -->
> Foo
> <!-- Bar -->
> Baz
> <!-- Quux -->
> Quux
> 
> Something
> Something
> Another something
> EOF

Gives:

Foo
<!-- Bar -->
Baz
Quux

Something
Something
Another something

You may need to tweak this to handle indentation, but that shouldn't be too surprising. You may also want to switch to sed -r, which will require the that the parentheses are NOT escaped.

Upvotes: 1

anubhava
anubhava

Reputation: 784908

You can use this awk:

awk '/<!--.*?-->/{h=$0; gsub(/ *(<!--|-->) */, ""); s=$0; next}
      $0!=s{$0=h ORS $0} 1' file.html
Here's a first line
<!-- Here's a second line -->
Here's a third line

Upvotes: 1

Related Questions