Reputation: 141
Given a long text file like this one (that we will call file.txt
):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?
Upvotes: 2
Views: 1641
Reputation: 212198
Assuming whitespace is significant, the typical solution is:
awk '!x[$0]++' file.txt
(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)
--EDIT-- Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:
awk '!x[ substr( $0, 2 )]++' file.txt
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x
(one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0
. In the second case we are only using the substring consisting of everything including and after the 2nd character.
Upvotes: 9
Reputation: 5791
Try this simple script:
cat file.txt | sort | uniq
cat
will output the contents of the file,
sort
will put duplicate entries adjacent to each other
uniq
will remove adjcacent duplicate entries.
Hope this helps!
Upvotes: 8
Reputation: 26753
The uniq
command will do what you want.
But make sure the file is sorted first, it only checks for consecutive lines.
Like this:
sort file.txt | uniq
Upvotes: 4