Reputation: 2446
Let's say I have some tab-separated data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
And I want to sort it by the number of times a name occurs in the first column (max to min) So we'd have Peter (3 occurrences) Joe (2 occurrences) and Laura (1 occurrence).
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
It only needs sorted by the first column, not the second. I've been reading sort's documentation, and I don't think it has the functionality. Anyone have an easy method?
Upvotes: 1
Views: 2491
Reputation: 89087
This works:
for person in $(awk '{print $1}' file.txt | sort | uniq -c | sort -dnr | awk '{print $2}');
do grep -e "^$person[[:space:]]" file.txt;
done
Upvotes: 1
Reputation: 2269
Here's another one using awk:
awk '{ a[ $1, ++b[$1] ]=$0 ; if(b[$1]>max) max=b[$1] }
END{ for(x=max;x>=1;x--) for( k in b ) if( a[k,x] ) for(y=1;y<=x;y++) { print a[k,y] delete a[k,y] } }' filename
It works fine with gawk and POSIX awk. The presence of three loops in the END statement might affect performance with big files.
Upvotes: 0
Reputation: 754280
That's a surprisingly hard sort criterion. This code works, but it is pretty ugly:
data=${1:-data}
awk '{ print $1 }' $data |
sort |
uniq -c |
sort -k2 |
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
sort -k1,1nr -k3,3 -k2n |
awk 'BEGIN{OFS="\t"} { print $3, $4 }'
It assumes bash
4.x for 'process substitution' but doesn't use any sorting built into awk
(that's a GNU extension compared with POSIX awk
). With an explicit temporary file, it can be made to work in shells without process substitution.
data=${1:-data} # File named on command line, or uses name 'data'
awk '{ print $1 }' $data | # List of names
sort | # Sorted list of names
uniq -c | # Count occurrences of each name
sort -k2 | # Sort in name order
join -1 2 -2 2 -o 1.1,2.1,2.2,2.3 - <(awk '{ print NR, $0 }' $data | sort -k2) |
# The process substitution numbers each record in sequence and sorts in name order
# The join matches the names (column 2) and outputs the frequency, record number, name, value
sort -k1,1nr -k3,3 -k2n | # Sort on frequency reversed, name, original line number
awk 'BEGIN{OFS="\t"} { print $3, $4 }' # Print name and value
Using GNU awk
with a built-in sort, or Perl or Python, is probably better than this.
For the original data, the output is:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
Given this extended version of the data:
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
Peter 50
Joe 80
Peter 70
Peter 80
Joe 40
Laura 30
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Pater 50
Jae 80
Pater 70
Pater 80
Jae 40
Laura 30
The output is:
Peter 5
Peter 7
Peter 8
Peter 50
Peter 70
Peter 80
Peter 700
Peter 800
Peter 7002
Peter 8002
Peter 7000
Peter 8000
Peter 7001
Peter 8001
Joe 8
Joe 4
Joe 80
Joe 40
Laura 3
Laura 30
Laura 30
Pater 50
Pater 70
Pater 80
Jae 80
Jae 40
The -k3,3
sort term is necessary for this data set; it sorts Laura's entries before Pater's entries (when omitted, you get those two lists interleaved).
Upvotes: 0
Reputation: 54452
Here's one way using GNU awk
. Run like:
awk -f script.awk file
Contents of script.awk
:
BEGIN {
FS="\t"
}
{
c[$1]++
r[$1] = (r[$1] ? r[$1] ORS : "") $0
}
END {
for (i in c) {
a[c[i],i] = i
}
n = asorti(a)
for (i=1;i<=n;i++) {
split(a[i], b, SUBSEP)
x[++j] = b[2]
}
for (i=n;i>=1;i--) {
print r[x[i]]
}
}
Results:
Peter 5
Peter 7
Peter 8
Joe 8
Joe 4
Laura 3
Upvotes: 0
Reputation: 195169
not sexy but works for your example:
awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' file file|sort -nr|sed -r 's/[0-9]* //'
test with your data:
kent$ cat n.txt
Peter 5
Joe 8
Peter 7
Peter 8
Joe 4
Laura 3
kent$ awk 'NR==FNR{a[$1]++;next}{ print a[$1],$0}' n.txt n.txt|sort -nr|sed -r 's/[0-9]* //'
Peter 8
Peter 7
Peter 5
Joe 8
Joe 4
Laura 3
Upvotes: 4