Daniel
Daniel

Reputation: 21

What did I do wrong? Not sorting properly with awk

Hi so basically I have a 'temp' text file that I'm using that has a long list of various email addresses (some repeats). What I'm trying to output is the email addresses in order of highest frequency and then the total number of unique email addresses at the end.

awk '{printf "%s %s\n", $2, $1} END {print "total "NR}' temp | sort -n | uniq -c -i

So far I got the output I wanted except for the fact that it's not ordered in terms of highest frequency. Instead, it's in alphabetical order.

I've been stuck on this for a few hours now and have no idea why. I know I probably did something wrong but I'm not sure. Please let me know if you need more information and if the code I provided was not the problem. Thank you in advance.

edit: I've also tried doing sort -nk1 (output has frequency in first column) and even -nk2

edit2: Here is a sample of my 'temp' file

aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
charter.net
yahoo.com

edit 3:

expected output:

33 aol.com
24 netscape.net
18 yahoo.com
5 adelphia.net
4 twcny.rr.com
3 charter.net
total 6

(no repeat emails, 6 total unique email addresses)

Upvotes: 0

Views: 128

Answers (3)

James Brown
James Brown

Reputation: 37404

In Gnu awk using @Sundeep's data:

$ cat program.awk
{ a[$0]++ }                                # count domains 
END {
    PROCINFO["sorted_in"]="@val_num_desc"  # sort in desc order in for loop
    for(i in a) {                          # this for in desc order
        print a[i], i 
        j++                                # count total
    } 
    print "total", j
}

Run it:

$ awk -f program.awk ip.txt
3 yahoo.com
2 netscape.net
1 twcny.rr.com
1 aol.com
1 adelphia.net
1 charter.net
total 6

Upvotes: 2

hmedia1
hmedia1

Reputation: 6180

Updated / Summary

Summarising a few tested approaches here for this handy sorting tool:

Using bash (In my case v4.3.46)

sortedfile="$(sort temp)" ; countedfile="$(uniq -c <<< "$sortedfile")" ; uniquefile="$(sort -rn <<< "$countedfile")" ; totalunique="$(wc -l <<< "$uniquefile")" ; echo -e "$uniquefile\nTotal: $totalunique"

Using sh/ash/busybox (Though they aren't all the same, they all worked the same for these tests)

time (sort temp > /tmp/sortedfile ; uniq -c /tmp/sortedfile > /tmp/countedfile ; sort -rn /tmp/countedfile > /tmp/uniquefile ; totalunique="$(cat /tmp/uniquefile | wc -l)" ; cat /tmp/uniquefile ; echo "Total: $totalunique")

Using perl (see this answer https://stackoverflow.com/a/40145395/3544399)

perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "Total: ", $c}' temp

What was tested

A file temp was created using a random generator:

  • @domain.com was different in the unique addresses
  • Duplicated addresses were scattered
  • File had 55304 total addresses
  • File has 17012 duplicate addresses

A small sample of the file looks like this:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Performance:

For the sake of completeness it's worth mentioning the performance;

perl:               sh:                 bash:

Total: 17012        Total:    17012     Total:    17012

real    0m0.119s    real    0m0.838s    real    0m0.973s
user    0m0.061s    user    0m0.772s    user    0m0.894s
sys     0m0.027s    sys     0m0.025s    sys     0m0.056s

Original Answer (Counted total addresses and not unique addresses):

tcount="$(cat temp | wc -l)" ; sort temp | uniq -c -i | sort -rn ; echo "Total: $tcount"
  • tcount="$(cat temp | wc -l)": Make Variable with line count
  • sort temp: Group email addresses ready for uniq
  • uniq -c -i: Count occurrences allowing for case variation
  • sort -rn: Sort according to numerical occurrences and reverse the order (highest on top)
  • echo "Total: $tcount": Show the total addresses at the bottom

Sample temp file:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Sample Output:

   5 [email protected]
   3 [email protected]
   3 [email protected]
   1 [email protected]
   1 [email protected]
Total:       13

Edit: See comments below regarding use of sort

Upvotes: 1

Sundeep
Sundeep

Reputation: 23667

Sample input modified to include an email with two instances

$ cat ip.txt 
aol.com
netscape.net
yahoo.com
yahoo.com
adelphia.net
twcny.rr.com
netscape.net
charter.net
yahoo.com

Using perl

$ perl -lne '
$c++ if !$h{$_}++;
END
{
    @k = sort { $h{$b} <=> $h{$a} } keys %h;
    print "$h{$_} $_" foreach (@k);
    print "total ", $c;
}' ip.txt
3 yahoo.com
2 netscape.net
1 adelphia.net
1 charter.net
1 aol.com
1 twcny.rr.com
total 6
  • $c++ if !$h{$_}++ increment counter for unique input lines, increment hash value with input line as key. Default initial value is 0 for both
  • After processing all input lines:
    • @k = sort { $h{$b} <=> $h{$a} } keys %h get keys sorted by descending numeric values of hash
    • print "$h{$_} $_" foreach (@k) print each hash value and key based on sorted keys @k
    • print "total ", $c print total unique lines


Can be written in single line if preferred:

perl -lne '$c++ if !$h{$_}++; END{@k = sort { $h{$b} <=> $h{$a} } keys %h; print "$h{$_} $_" foreach (@k); print "total ", $c}' ip.txt


Reference: How to sort perl hash on values and order the keys correspondingly

Upvotes: 2

Related Questions