Reputation: 47
I am trying to keep the rows with highest value in column 2 per column 1 including identical rows with max value like the desired output below.
Data is
a 55
a 66
a 130
b 88
b 99
b 99
c 110
c 130
c 130
Desired output is
a 130
b 99
b 99
c 130
c 130
I could find great answers from this site, but not exactly for the current question.
awk '{ max=(max>$2?max:$2); arr[$2]=(arr[$2]?arr[$2] ORS:"")$0 } END{ print arr[max] }' file
yields the output which includes the identical rows But max value is from all rows not per column 1.
a 130
c 130
c 130
awk '$2>max[$1] {max[$1]=$2 ; row[$1]=$0} END{for (i in row) print row[i]}' file
Output includes the max value per column 1 but NOT include identical rows with max values.
a 130
b 99
c 130
Would you please help me to trim the data in desired way. Even all codes above are obtained from your questions and answers in this site. Appreciate that!! Many thanks for helps in advance!!!
Upvotes: 2
Views: 244
Reputation: 204638
Using any awk:
awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
' file
a 130
b 99
b 99
c 130
c 130
Upvotes: 3
Reputation: 19271
Another way using awk
. The second loop should be light, just repeating the duplicated max values.
% awk 'arr[$1] < $2{arr[$1] = $2; # get max value
co[$1]++; if(co[$1] == 1){x++; id[x] = $1}} # count unique ids
arr[$1] == $2{n[$1,arr[$1]]++} # count repeated max
END{for(i=1; i<=x; i++){
for(j=1; j<=n[id[i],arr[id[i]]]; j++){print id[i], arr[id[i]]}}}' file
a 130
b 99
b 99
c 130
c 130
or, if order doesn't matter
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++}
END{for(i in arr){
j=0; do{print i, arr[i]; j++} while(j < n[i,arr[i]])}}' file
c 130
c 130
b 99
b 99
a 130
-- EDIT --
Printing data in additional columns
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
c 130 data8
c 130 data9
b 99 data5
b 99 data6
a 130 data3
% cat file
a 55 data1
a 66 data2
a 130 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9
Upvotes: 2
Reputation: 104102
Here is a ruby to do that:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 55", "a 66", "a 130"], "b"=>["b 88", "b 99", "b 99"], "c"=>["c 110", "c 130", "c 130"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[-1].to_f}}
# ["a 130", "b 99", "c 130"]
grps.values.flatten.each{|s| puts s if maxes.include?(s)}
' file
Prints:
a 130
b 99
b 99
c 130
c 130
Upvotes: 2
Reputation: 26225
I've used this approach in the past:
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' test.txt test.txt
a 130
b 99
b 99
c 130
c 130
This requires you to pass in the same file twice (i.e. awk '...' test.txt test.txt), so it's not ideal, but hopefully it provides the required output with your actual data.
Upvotes: 3