Looking for a one-liner to remove duplicate multiline paragraphs from a file

Question

File:
this is a paragraph
to find in another 
file

some stuff .. 

more stuff ... 

this is a paragraph
to find in another 
file

more stuff ... 

another paragraph 
to match

yet more stuff.. 

this is a paragraph
duplicate in this 
file

another paragraph 
to match 

this is a paragraph
duplicate in this 
file

yet more stuff..

this is a paragraph
to find in another
file

should return:

this is a paragraph
to find in another 
file

some stuff .. 

more stuff ... 

more stuff ... 

another paragraph 
to match

yet more stuff.. 

this is a paragraph
duplicate in this 
file

yet more stuff..

I have found pcregrep -n -M, I know I could loop around and search each paragraph using sed and this command but pcregrep is not on every system so if it can be avoided that would be good. Looking for something elegant using the standard *nix stuff and preferably not perl.

* Some good posts and ideas but they didn't work generally though did on the limited case that I posted, so I have adjusted the example data so you can see if it will work more generally

* Here is a sed one-liner that prints multiple line paragraphs only:

sed -e '/./{H;$!d;}' -e 'x;/.*
.*
.*/!d' file

Etan Reisner · Accepted Answer

This mostly does what you want. The only problem (I know of offhand) is that it collapses runs of blank lines in the input into a single blank line in the output.

awk -v RS= '!x[$0]++{print; print ""}'

Use the fact that "If RS is set to the null string, then records are separated by blank lines." and print out an extra blank line for the RS that awk swallowed.

Edit: Incorporating @EdMorton's suggestions gets you this instead.

awk -v RS= -v ORS='

' '!seen[$0]++'

And awk -v RS= '!seen[$0]++{ORS=RT; print}' for GNU awk to keep spacing between paragraphs consistent with the input (instead of collapsing runs of blank lines).

Edit again:

This version seems to work correctly (with GNU awk 3.1.7 and newer, I don't know about 3.1.6) with the one exception that it adds a blank line to the end of the file.

awk -v RS= '{gsub(/[[:blank:]]+$/,""); gsub(/[[:blank:]]+
/,"
")} !seen[$0]++{ORS=RT;print}'

Looking for a one-liner to remove duplicate multiline paragraphs from a file

Answers (1)

Related Questions