Reputation:
File:
this is a paragraph
to find in another
file
some stuff ..
more stuff ...
this is a paragraph
to find in another
file
more stuff ...
another paragraph
to match
yet more stuff..
this is a paragraph
duplicate in this
file
another paragraph
to match
this is a paragraph
duplicate in this
file
yet more stuff..
this is a paragraph
to find in another
file
should return:
this is a paragraph
to find in another
file
some stuff ..
more stuff ...
more stuff ...
another paragraph
to match
yet more stuff..
this is a paragraph
duplicate in this
file
yet more stuff..
I have found pcregrep -n -M, I know I could loop around and search each paragraph using sed and this command but pcregrep is not on every system so if it can be avoided that would be good. Looking for something elegant using the standard *nix stuff and preferably not perl.
* Some good posts and ideas but they didn't work generally though did on the limited case that I posted, so I have adjusted the example data so you can see if it will work more generally
* Here is a sed one-liner that prints multiple line paragraphs only:
sed -e '/./{H;$!d;}' -e 'x;/.*\n.*\n.*/!d' file
Upvotes: 0
Views: 967
Reputation: 80992
This mostly does what you want. The only problem (I know of offhand) is that it collapses runs of blank lines in the input into a single blank line in the output.
awk -v RS= '!x[$0]++{print; print ""}'
Use the fact that "If RS is set to the null string, then records are separated by blank lines." and print out an extra blank line for the RS
that awk swallowed.
Edit: Incorporating @EdMorton's suggestions gets you this instead.
awk -v RS= -v ORS='\n\n' '!seen[$0]++'
And awk -v RS= '!seen[$0]++{ORS=RT; print}'
for GNU awk to keep spacing between paragraphs consistent with the input (instead of collapsing runs of blank lines).
Edit again:
This version seems to work correctly (with GNU awk 3.1.7 and newer, I don't know about 3.1.6) with the one exception that it adds a blank line to the end of the file.
awk -v RS= '{gsub(/[[:blank:]]+$/,""); gsub(/[[:blank:]]+\n/,"\n")} !seen[$0]++{ORS=RT;print}'
Upvotes: 3