Reputation: 2446

Sort a column by number of identical occurrences - using awk, sort, tr or uniq?

Let's say I have some tab-separated data:

Peter   5
Joe     8
Peter   7
Peter   8
Joe     4
Laura   3

And I want to sort it by the number of times a name occurs in the first column (max to min) So we'd have Peter (3 occurrences) Joe (2 occurrences) and Laura (1 occurrence).

Peter   5
Peter   7
Peter   8
Joe     8
Joe     4
Laura   3

It only needs sorted by the first column, not the second. I've been reading sort's documentation, and I don't think it has the functionality. Anyone have an easy method?

Upvotes: 1

Answers (5)

flodel

Reputation: 89087

This works:

for person in $(awk '{print $1}' file.txt | sort | uniq -c | sort -dnr | awk '{print $2}');
do grep -e "^$person[[:space:]]" file.txt;
done

Upvotes: 1

lind

Reputation: 2269

Here's another one using awk:

awk '{ a[ $1, ++b[$1] ]=$0 ; if(b[$1]>max) max=b[$1] }

   END{ for(x=max;x>=1;x--)
         for( k in b )
           if( a[k,x] )
              for(y=1;y<=x;y++) {
                    print a[k,y]
                    delete a[k,y]
               }
   }' filename

It works fine with gawk and POSIX awk. The presence of three loops in the END statement might affect performance with big files.

Upvotes: 0

Jonathan Leffler

Reputation: 754280

That's a surprisingly hard sort criterion. This code works, but it is pretty ugly:

data=${1:-data}
awk '{ print $1 }' $data |
sort |
uniq -c |
sort -k2 |
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
sort -k1,1nr -k3,3 -k2n |
awk 'BEGIN{OFS="\t"} { print $3, $4 }'

It assumes bash 4.x for 'process substitution' but doesn't use any sorting built into awk (that's a GNU extension compared with POSIX awk). With an explicit temporary file, it can be made to work in shells without process substitution.

data=${1:-data}                  # File named on command line, or uses name 'data'
awk '{ print $1 }' $data |       # List of names
sort |                           # Sorted list of names
uniq -c |                        # Count occurrences of each name
sort -k2 |                       # Sort in name order
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
# The process substitution numbers each record in sequence and sorts in name order
# The join matches the names (column 2) and outputs the frequency, record number, name, value
sort -k1,1nr -k3,3 -k2n |        # Sort on frequency reversed, name, original line number
awk 'BEGIN{OFS="\t"} { print $3, $4 }'   # Print name and value

Using GNU awk with a built-in sort, or Perl or Python, is probably better than this.

For the original data, the output is:

Peter   5
Peter   7
Peter   8
Joe     8
Joe     4
Laura   3

Given this extended version of the data:

Peter   5
Joe     8
Peter   7
Peter   8
Joe     4
Laura   3
Peter   50
Joe     80
Peter   70
Peter   80
Joe     40
Laura   30
Peter   700
Peter   800
Peter   7002
Peter   8002
Peter   7000
Peter   8000
Peter   7001
Peter   8001
Pater   50
Jae     80
Pater   70
Pater   80
Jae     40
Laura   30

The output is:

Peter   5
Peter   7
Peter   8
Peter   50
Peter   70
Peter   80
Peter   700
Peter   800
Peter   7002
Peter   8002
Peter   7000
Peter   8000
Peter   7001
Peter   8001
Joe     8
Joe     4
Joe     80
Joe     40
Laura   3
Laura   30
Laura   30
Pater   50
Pater   70
Pater   80
Jae     80
Jae     40

The -k3,3 sort term is necessary for this data set; it sorts Laura's entries before Pater's entries (when omitted, you get those two lists interleaved).

Upvotes: 0

Steve

Reputation: 54452

Here's one way using GNU awk. Run like:

awk -f script.awk file

Contents of script.awk:

BEGIN {
    FS="\t"
}

{
    c[$1]++
    r[$1] = (r[$1] ? r[$1] ORS : "") $0
}

END {

    for (i in c) {
        a[c[i],i] = i
    }

    n = asorti(a)

    for (i=1;i<=n;i++) {
        split(a[i], b, SUBSEP)
        x[++j] = b[2]
    }

    for (i=n;i>=1;i--) {
        print r[x[i]]
    }
}

Results:

Peter   5
Peter   7
Peter   8
Joe     8
Joe     4
Laura   3

Upvotes: 0

Kent

Reputation: 195169

not sexy but works for your example:

 awk  'NR==FNR{a[$1]++;next}{ print a[$1],$0}' file file|sort -nr|sed -r 's/[0-9]* //'

test with your data:

kent$  cat n.txt
Peter   5
Joe     8
Peter   7
Peter   8
Joe     4
Laura   3

kent$  awk  'NR==FNR{a[$1]++;next}{ print a[$1],$0}' n.txt n.txt|sort -nr|sed -r 's/[0-9]* //'
Peter   8
Peter   7
Peter   5
Joe     8
Joe     4
Laura   3

Upvotes: 4

Sort a column by number of identical occurrences - using awk, sort, tr or uniq?

Answers (5)

Related Questions