Deleting duplicates but leaving one according to sorting in a different column

Question

I have a huge table with hundreds of thousands of rows and ten columns. I'd like to remove those duplicates according to column 4 (ID) but at the same time leaving the one that has the highest value on column 9 (score).

I tried this simple code in awk but it's apparently removing more lines than it should and even not picking all the duplicates..

sort -rnk 9 myfile.txt | awk -F"[. ]" '!a[$4]++' >filtered.txt

I'd appreciate any advice; I'd love to get this working on python or perl but I'm not very good at it (yet).

Thanks

glenn jackman · Accepted Answer

Your code looks ok, it certainly is a clever way to do it. I would write

awk -F"	" '
    !($4 in max) || $9 > max[$4] {max[$4] = $9; line[$4] = $0} 
    END {for (key in line) print line[key]}
' myfile.txt

That approach should be pretty easy to translate to another language.

Deleting duplicates but leaving one according to sorting in a different column

Answers (2)

Related Questions