Reputation: 21
Hi so basically I have a 'temp' text file that I'm using that has a long list of various email addresses (some repeats). What I'm trying to output is the email addresses in order of highest frequency and then the total number of unique email addresses at the end.
awk '{printf "%s %s\n", $2, $1} END {print "total "NR}' temp | sort -n | uniq -c -i
So far I got the output I wanted except for the fact that it's not ordered in terms of highest frequency. Instead, it's in alphabetical order.
I've been stuck on this for a few hours now and have no idea why. I know I probably did something wrong but I'm not sure. Please let me know if you need more information and if the code I provided was not the problem. Thank you in advance.
edit: I've also tried doing sort -nk1 (output has frequency in first column) and even -nk2
edit2: Here is a sample of my 'temp' file
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
charter.net
yahoo.com
edit 3:
expected output:
33 aol.com
24 netscape.net
18 yahoo.com
5 adelphia.net
4 twcny.rr.com
3 charter.net
total 6
(no repeat emails, 6 total unique email addresses)
Upvotes: 0
Views: 128
Reputation: 37404
In Gnu awk using @Sundeep's data:
$ cat program.awk
{ a[$0]++ } # count domains
END {
PROCINFO["sorted_in"]="@val_num_desc" # sort in desc order in for loop
for(i in a) { # this for in desc order
print a[i], i
j++ # count total
}
print "total", j
}
Run it:
$ awk -f program.awk ip.txt
3 yahoo.com
2 netscape.net
1 twcny.rr.com
1 aol.com
1 adelphia.net
1 charter.net
total 6
Upvotes: 2
Reputation: 6180
Summarising a few tested approaches here for this handy sorting tool:
Using bash
(In my case v4.3.46)
sortedfile="$(sort temp)" ; countedfile="$(uniq -c <<< "$sortedfile")" ; uniquefile="$(sort -rn <<< "$countedfile")" ; totalunique="$(wc -l <<< "$uniquefile")" ; echo -e "$uniquefile\nTotal: $totalunique"
Using sh/ash/busybox
(Though they aren't all the same, they all worked the same for these tests)
time (sort temp > /tmp/sortedfile ; uniq -c /tmp/sortedfile > /tmp/countedfile ; sort -rn /tmp/countedfile > /tmp/uniquefile ; totalunique="$(cat /tmp/uniquefile | wc -l)" ; cat /tmp/uniquefile ; echo "Total: $totalunique")
Using perl
(see this answer https://stackoverflow.com/a/40145395/3544399)
perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "Total: ", $c}' temp
A file temp
was created using a random generator:
55304
total addresses17012
duplicate addressesA small sample of the file looks like this:
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
For the sake of completeness it's worth mentioning the performance;
perl: sh: bash:
Total: 17012 Total: 17012 Total: 17012
real 0m0.119s real 0m0.838s real 0m0.973s
user 0m0.061s user 0m0.772s user 0m0.894s
sys 0m0.027s sys 0m0.025s sys 0m0.056s
Original Answer (Counted total addresses and not unique addresses):
tcount="$(cat temp | wc -l)" ; sort temp | uniq -c -i | sort -rn ; echo "Total: $tcount"
tcount="$(cat temp | wc -l)"
: Make Variable with line countsort temp
: Group email addresses ready for uniquniq -c -i
: Count occurrences allowing for case variationsort -rn
: Sort according to numerical occurrences and reverse the order (highest on top)echo "Total: $tcount"
: Show the total addresses at the bottomSample temp file:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Sample Output:
5 [email protected]
3 [email protected]
3 [email protected]
1 [email protected]
1 [email protected]
Total: 13
Edit: See comments below regarding use of sort
Upvotes: 1
Reputation: 23667
Sample input modified to include an email with two instances
$ cat ip.txt
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
netscape.net
charter.net
yahoo.com
Using perl
$ perl -lne '
$c++ if !$h{$_}++;
END
{
@k = sort { $h{$b} <=> $h{$a} } keys %h;
print "$h{$_} $_" foreach (@k);
print "total ", $c;
}' ip.txt
3 yahoo.com
2 netscape.net
1 adelphia.net
1 charter.net
1 aol.com
1 twcny.rr.com
total 6
$c++ if !$h{$_}++
increment counter for unique input lines, increment hash value with input line as key. Default initial value is 0
for both@k = sort { $h{$b} <=> $h{$a} } keys %h
get keys sorted by descending numeric values of hashprint "$h{$_} $_" foreach (@k)
print each hash value and key based on sorted keys @k
print "total ", $c
print total unique lines
Can be written in single line if preferred:
perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "total ", $c}' ip.txt
Reference: How to sort perl hash on values and order the keys correspondingly
Upvotes: 2