Reputation: 315
My question is very similar to this previously asked question:
Output whole line once for each unique value of a column (Bash)
but with one major difference. In his example:
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> AIQLTGK 8 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> AIQLTGK 10 genes ADUm.1999,ADUm.3560
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
pep> VSSILEDKILSR 2 genes ADUm.2146,ADUm.5750
The goal was to "print a line for each distinct value of the peptides in column 2, meaning the above input would become:"
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 1 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
But what I would like to do is to print one line for each unique entry in column 2, however I would like to print the line with the highest value in column 3, so the output would look like this:
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> AIQLTGK 10 genes ADUm.1999,ADUm.3560
pep> KHEPPTEVDIEGR 5 genes ADUm.367
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
Thanks in advance.
Upvotes: 0
Views: 69
Reputation: 77185
Here is one way of doing it:
awk '
($2 in seen) {
line[$2] = ($3 > seen[$2]) ? $0 : line[$2];
next
}
{
seen[$2] = $3;
line[$2] = $0
}
END {
for(x in line) print line[x]
}' file
Output:
pep> AIQLTGK 10 genes ADUm.1999,ADUm.3560
pep> AEYTCVAETK 2 genes ADUm.1024,ADUm.5198,ADUm.750
pep> VSSILEDKILSR 3 genes ADUm.2146,ADUm.5750
pep> VSSILEDKTT 9 genes ADUm.1192,ADUm.2731
pep> KHEPPTEVDIEGR 5 genes ADUm.367
Upvotes: 1