Reputation: 1080

Keep row with higest value in a column with repeated ids

I'm looking for an easy solution to this problem in a one liner if possible. I think I could manage to do it in python but I'm sure there's a simpler way. I'm working with hundreds of records but this could scale up to millions.

I have a list of rows with several fields. The id (first column) is repeated several times and the some of the other columns can vary. I'm interested in keeping only one unique ID, specifically the one that has the highest value in a particular column, while keeping the rest of fields of that selected row. I.e.:

ID1 55 blue 4.5
ID1 57 red 4.3
ID2 65 orange 3.1
ID2 66 orange 3.2
ID2 90 orange 3.1
ID3 60 blue 4.5

If I would filter this data I would like to get, based on the second column:

ID1 57 red 4.3
ID2 90 orange 3.1
ID3 60 blue 4.5

Upvotes: 1

Answers (4)

Akshay Hegde

Reputation: 16997

GNU sort

$ cat file
ID1 55 blue 4.5
ID1 57 red 4.3
ID2 65 orange 3.1
ID2 66 orange 3.2
ID2 90 orange 3.1
ID3 60 blue 4.5

$ sort -k 1,1 -k 2,2nr file | sort -u -t' ' -k1,1
ID1 57 red 4.3
ID2 90 orange 3.1
ID3 60 blue 4.5

Upvotes: 2

glenn jackman

Reputation: 246847

awk '
    $2 > max[$1] {max[$1] = $2; line[$1] = $0} 
    END {for (id in line) print line[id]}
' file

Or, sort first by ID, then by value descending, and only print the line when a new ID is seen:

sort -k 1,1 -k 2,2nr file | awk '!seen[$1]++'

This method will ensure the output is sorted by ID.

Upvotes: 1

iruvar

Reputation: 23364

Assuming input file is sorted on ID field, this will return records in order

awk '$1 != prevID && NR > 1{print content[prevID]};
    $2 > max[$1] {max[$1]=$2; content[$1]=$0}; {prevID=$1};
    END{print content[prevID]}' file

Upvotes: 1

jaypal singh

Reputation: 77105

You can try something like this:

awk '($1 in a){b[$1] = ($2 > a[$1]) ? $0 : b[$1]; next}
     {a[$1] = $2; b[$1] = $0}
     END{for(x in b) print b[x]}' file

$ awk '($1 in a){b[$1] = ($2 > a[$1]) ? $0 : b[$1]; next}
{a[$1] = $2; b[$1] = $0}
END{for(x in b) print b[x]}' file
ID1 55 blue 4.5
ID2 90 orange 3.1
ID3 60 blue 4.5

Note: The output may be out of order

Upvotes: 0

Keep row with higest value in a column with repeated ids

Answers (4)

Related Questions