Reputation: 10882
I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE
without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.
Upvotes: 4
Views: 272
Reputation: 17381
For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0]
is a counter (of appearance) for each unique line, for any line ($0
) appearing for the second time _[$0] >= 1
, thus !_[$0]
evaluates to false
, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)
Upvotes: 0
Reputation: 189417
I'll note that this script does not appear to work with my copy of sed
(GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C
it works fine.
Here's an annotated version of the script. sed
basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d
).
You can get it to work independently of locale by changing [ -~]
to [[:print:]]
Upvotes: 5
Reputation: 60868
The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G
: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.s/\n/&&/
: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.^\([^\n]*\n\).*\n\1/
: Look through the current text for the following: at the beginning of all the lines (^
) look for a first line including trailing newline (\([^\n]*\n\)
), then anything (.*
), then a newline (\n
), and then that same first line including newline repeated again (\1
). If two subsequent lines are the same, then the .*
in the regular expression will match the empty string, but the two \n
will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.d
: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.s/\n//
: Otherwise, we continue and next turn the double newline back into a single newline.h
: We include the current line in our list of all unique lines.P
: And finally print this new unique line, up to the newline character.Upvotes: 3