Reputation: 1080
I'm looking for an easy solution to this problem in a one liner if possible. I think I could manage to do it in python but I'm sure there's a simpler way. I'm working with hundreds of records but this could scale up to millions.
I have a list of rows with several fields. The id (first column) is repeated several times and the some of the other columns can vary. I'm interested in keeping only one unique ID, specifically the one that has the highest value in a particular column, while keeping the rest of fields of that selected row. I.e.:
ID1 55 blue 4.5
ID1 57 red 4.3
ID2 65 orange 3.1
ID2 66 orange 3.2
ID2 90 orange 3.1
ID3 60 blue 4.5
If I would filter this data I would like to get, based on the second column:
ID1 57 red 4.3
ID2 90 orange 3.1
ID3 60 blue 4.5
Upvotes: 1
Views: 128
Reputation: 16997
GNU sort
$ cat file
ID1 55 blue 4.5
ID1 57 red 4.3
ID2 65 orange 3.1
ID2 66 orange 3.2
ID2 90 orange 3.1
ID3 60 blue 4.5
$ sort -k 1,1 -k 2,2nr file | sort -u -t' ' -k1,1
ID1 57 red 4.3
ID2 90 orange 3.1
ID3 60 blue 4.5
Upvotes: 2
Reputation: 246847
awk '
$2 > max[$1] {max[$1] = $2; line[$1] = $0}
END {for (id in line) print line[id]}
' file
Or, sort first by ID, then by value descending, and only print the line when a new ID is seen:
sort -k 1,1 -k 2,2nr file | awk '!seen[$1]++'
This method will ensure the output is sorted by ID.
Upvotes: 1
Reputation: 23364
Assuming input file is sorted on ID field, this will return records in order
awk '$1 != prevID && NR > 1{print content[prevID]};
$2 > max[$1] {max[$1]=$2; content[$1]=$0}; {prevID=$1};
END{print content[prevID]}' file
Upvotes: 1
Reputation: 77105
You can try something like this:
awk '($1 in a){b[$1] = ($2 > a[$1]) ? $0 : b[$1]; next}
{a[$1] = $2; b[$1] = $0}
END{for(x in b) print b[x]}' file
$ awk '($1 in a){b[$1] = ($2 > a[$1]) ? $0 : b[$1]; next}
{a[$1] = $2; b[$1] = $0}
END{for(x in b) print b[x]}' file
ID1 55 blue 4.5
ID2 90 orange 3.1
ID3 60 blue 4.5
Note: The output may be out of order
Upvotes: 0