David
David

Reputation: 3105

Sorting array in shell using awk

I need to sort this file in descending order avoiding duplicates

Bob 5 404
Mike 3 404
Bob 19 404
Bob 78 404
Mike 93 404
Joe 7 404

So my result should be

Bob 102
Mike 96
Joe 7

What I have now is this

awk '{if($3 == 404) arr[$1]+=$2}END{for(i in arr)print i, arr[i]}' file

I know that there are sort -d but how I need to use it in awk?

UPDATE

awk 'BEGIN{FS=" "}{if($9 == 404) arr[$1]+=1}END{for(i in arr) print arr[i] | sort -k2nr }' input > output

I get this result

sh: 0:  not found

And my output file is now empty.

Upvotes: 1

Views: 3651

Answers (2)

mklement0
mklement0

Reputation: 437953

Reuben L.'s answer contains the right pointers, but doesn't spell out the full solutions:


The POSIX-compliant solution spelled out:

You need to pipe the output from awk to the sort utility, outside of awk:

awk '{ if($3 == 404) arr[$1]+=$2 } END{ for (i in arr) print i, arr[i] }' input |
  sort -rn -k2,2 > output

Note the specifics of the sort command:

  • -r performs reverse sorting
  • -n performs numeric sorting
  • -k2,2 sorts by the 2nd whitespace-separated field only
    • by contrast, only specifying -k2 would sort starting from the 2nd field through the remainder of the line - doesn't make a difference here, since the 2nd field is the last field, but it's an important distinction in general.

Note that there's really no benefit to using the nonstandard -V option to get numeric sorting, as -n will do just fine; -V's true purpose is to perform version-number sorting.

Note that you could include the sort command inside your awk script - for(i in arr)print i, arr[i] | "sort -nr -k2,2" - note the " around the sort command - but there's little benefit to doing so.


The GNU awk asort() solution spelled out:

gawk '
  { if ($3 == 404) arr[$1]+=$2 } # build array
  END{
    for (k in arr) { amap[arr[k]] = k }   # create value-to-key(!) map
    asort(arr, asorted, "@val_num_desc")  # sort values numerically, in descending order
    # print in sort order
    for (i=1; i<=length(asorted); ++i) print amap[asorted[i]], asorted[i]
  }
' input > output

As you can see, this complicates the solution, because 2 extra arrays must be created:

  • for (k in arr) { amap[arr[k]] = k } creates the "inverse" of the original array in amap: it uses the values of the original array as keys and the corresponding keys as the values.
  • asort(arr, asorted, "@val_num_desc") then sorts the original array by its values in descending, numerical order ("@val_num_desc") and stores the result in new array asorted.
    • Note that the original keys are lost in the process: asorted keys are now numerical indices reflecting the sort order.
  • for (i=1; i<=length(asorted); ++i) print amap[asorted[i]], asorted[i] then enumerates asorted by sequential numerical index, which yields the desired sort order; amap[asorted[i]] returns the matching key (e.g., Bob) from the original array for the value at hand.

Upvotes: 3

Reuben L.
Reuben L.

Reputation: 2859

Two possible solutions:

  1. Use gawk and the built-in asort() and asorti() functions

  2. Pipe the output of your awk command to sort -k2 -Vr. This will sort descending by the second column.

note: the -V flag is non-standard and is available for GNU sort. credits to Jonathan Leffler

Upvotes: 0

Related Questions