user1140126
user1140126

Reputation: 2649

How to keep only those rows which are unique in a tab-delimited file in unix

Here, two rows are considered redundant if second value is same. Is there any unix/linux command that can achieve the following.

1   aa
2   aa
1   ss
3   dd
4   dd

Result

1   aa
1   ss
3   dd

I generally use the following command but it does not achieve what I want here.

sort -k2 /Users/fahim/Desktop/delnow2.csv | uniq

Edit:

My file had roughly 25 million lines: Time when using the solution suggested by @Steve : 33 seconds.

$date; awk -F '\t' '!a[$2]++' myfile.txt  > outfile.txt; date
Wed Nov 27 18:00:16 EST 2013
Wed Nov 27 18:00:49 EST 2013

The sort and unique is taking too much time. I quit after waiting for 5 minutes.

Upvotes: 2

Views: 426

Answers (2)

Michael Kruglos
Michael Kruglos

Reputation: 1286

I understand that you want a unique sorted file by the second field. You need to add -u to sort to achieve this.

sort -u -k2 /Users/fahim/Desktop/delnow2.csv

Upvotes: 1

Steve
Steve

Reputation: 54402

Perhaps this is what you're looking for:

awk -F "\t" '!a[$2]++' file

Results:

1   aa
1   ss
3   dd

Upvotes: 5

Related Questions