Ale
Ale

Reputation: 41

Deleting duplicates but leaving one according to sorting in a different column

I have a huge table with hundreds of thousands of rows and ten columns. I'd like to remove those duplicates according to column 4 (ID) but at the same time leaving the one that has the highest value on column 9 (score).

I tried this simple code in awk but it's apparently removing more lines than it should and even not picking all the duplicates..

sort -rnk 9 myfile.txt | awk -F"[. ]" '!a[$4]++' >filtered.txt

I'd appreciate any advice; I'd love to get this working on python or perl but I'm not very good at it (yet).

Thanks

Upvotes: 1

Views: 86

Answers (2)

peak
peak

Reputation: 116650

Since the input file is so large, it would make sense to avoid requiring awk to store more information than necessary, so assuming sort is not prohibitively slow, you may wish to consider the following optimized variant of the sort/awk program, provided at least that the id is never the empty string:

sort -t $'\t' -k4,4 -rnk 9,9 | awk -F\\t 'key!=$4 {key=$4; print}'

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 246744

Your code looks ok, it certainly is a clever way to do it. I would write

awk -F"\t" '
    !($4 in max) || $9 > max[$4] {max[$4] = $9; line[$4] = $0} 
    END {for (key in line) print line[key]}
' myfile.txt

That approach should be pretty easy to translate to another language.

Upvotes: 2

Related Questions