Reputation: 67299

How to delete duplicate lines in a file without sorting it in Unix

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk.

Is that possible?

Upvotes: 212

Answers (9)

BobDodds

Reputation: 23

uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.

I think that the $!N; needs curly braces or else it continues, and that is the cause of the infinite loop.

I have Bash 5.0 and sed 4.7 in Ubuntu 20.10 (Groovy Gorilla). The second one-liner did not work, at the character set match.

The are three variations. The first is to eliminate adjacent repeat lines, the second to eliminate repeat lines wherever they occur, and the third to eliminate all but the last instance of lines in file.

pastebin

# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.

dedupe() {
 sed -E '
  $!{
   N;
   s/[ \t]+$//;
   /^(.*)\n\1$/!P;
   D;
  }
 ';
}

# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one

norepeat() {
 sed -n -E '
  s/[ \t]+$//;
  G;
  /^(\n){2,}/d;
  /^([^\n]+).*\n\1(\n|$)/d;
  h;
  P;
  ';
}

lastrepeat() {
 sed -n -E '
  s/[ \t]+$//;
  /^$/{
   H;
   d;
  };
  G;
  # delete previous repeated line if found
  s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
  # after searching for previous repeat, move tested last line to end
  s/^([^\n]+)(\n)(.*)/\3\2\1/;
  $!{
   h;
   d;
  };
  # squeeze blank lines to one
  s/(\n){3,}/\n\n/g;
  s/^\n//;
  p;
 ';
}

Upvotes: 2

Aashutosh Kumar

Reputation: 831

This can be achieved using AWK.

The below line will display unique values:

awk file_name | uniq

You can output these unique values to a new file:

awk file_name | uniq > uniq_file_name

The new file uniq_file_name will contain only unique values, without any duplicates.

Upvotes: -4

Weike

Reputation: 1270

The first solution is also from http://sed.sourceforge.net/sed1line.txt

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5

The core idea is:

Print only once of each duplicate consecutive lines at its last appearance and use the D command to implement the loop.

Explanation:

$!N;: if the current line is not the last line, use the N command to read the next line into the pattern space.
/^(.*)\n\1$/!P: if the contents of the current pattern space is two duplicate strings separated by \n, which means the next line is the same with current line, we can not print it according to our core idea; otherwise, which means the current line is the last appearance of all of its duplicate consecutive lines. We can now use the P command to print the characters in the current pattern space until \n (\n also printed).
D: we use the D command to delete the characters in the current pattern space until \n (\n also deleted), and then the content of pattern space is the next line.
and the D command will force sed to jump to its first command $!N, but not read the next line from a file or standard input stream.

The second solution is easy to understand (from myself):

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5

The core idea is:

print only once of each duplicate consecutive lines at its first appearance and use the : command and t command to implement LOOP.

Explanation:

read a new line from the input stream or file and print it once.
use the :loop command to set a label named loop.
use N to read the next line into the pattern space.
use s/^(.*)\n\1$/\1/ to delete the current line if the next line is the same with the current line. We use the s command to do the delete action.
if the s command is executed successfully, then use the tloop command to force sed to jump to the label named loop, which will do the same loop to the next lines until there are no duplicate consecutive lines of the line which is latest printed; otherwise, use the D command to delete the line which is the same with the latest-printed line, and force sed to jump to the first command, which is the p command. The content of the current pattern space is the next new line.

Upvotes: 6

Chris Koknat

Reputation: 3451

Perl one-liner similar to jonas's AWK solution:

perl -ne 'print if ! $x{$_}++' file

This variation removes trailing white space before comparing:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

This variation edits the file in-place:

perl -i -ne 'print if ! $x{$_}++' file

This variation edits the file in-place, and makes a backup file.bak:

perl -i.bak -ne 'print if ! $x{$_}++' file

Upvotes: 25

Sadhun

Reputation: 264

Use:

cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'

It deletes the duplicate lines using AWK.

Upvotes: -4

Bohr

Reputation: 1736

An alternative way using Vim (Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq

Upvotes: 8

Bradley Kreider

Reputation: 1155

The one-liner that Andre Miller posted works except for recent versions of sed when the input file ends with a blank line and no characterss. On my Mac my CPU just spins.

This is an infinite loop if the last line is blank and doesn't have any characterss:

sed '$!N; /^$.*$\n\1$/!P; D'

It doesn't hang, but you lose the last line:

sed '$d;N; /^$.*$\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".

Upvotes: 6

Jonas Elfström

Reputation: 31468

awk '!seen[$0]++' file.txt

seen is an associative array that AWK will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. AWK will print the lines where the expression evaluates to true.

The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on. AWK evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

Upvotes: 402

Andre Miller

Reputation: 15533

From http://sed.sourceforge.net/sed1line.txt: (Please don't ask me how this works ;-) )

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

Upvotes: 37

How to delete duplicate lines in a file without sorting it in Unix

Answers (9)

The first solution is also from http://sed.sourceforge.net/sed1line.txt

The second solution is easy to understand (from myself):

Related Questions