Reputation: 141

How to delete double lines in bash

Given a long text file like this one (that we will call file.txt):

EDITED

1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA

How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:

1 AA
2 ab
3 azd
6 aslmdkfj

I do not want to have the same lines in double, given a specific text file. Could you show me the command please?

Upvotes: 2

Answers (3)

William Pursell

Reputation: 212198

Assuming whitespace is significant, the typical solution is:

awk '!x[$0]++' file.txt

(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)

--EDIT-- Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:

awk '!x[ substr( $0, 2 )]++' file.txt

This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.

Upvotes: 9