100 Most Used Strings in File

Question

How can I find the top 100 most used strings (words) in a .txt file using Perl? So far I have the following:

use 5.012;
use warnings;

open(my $file, "<", "file.txt");

my %word_count;
while (my $line = <$file>) {
  foreach my $word (split ' ', $line) {
     $word_count{$word}++;
  } 
} 

for my $word (sort keys %word_count) {
  print "'$word': $word_count{$word}
";
}

But this only counts each word, and organizes it in alphabetical order. I want the top 100 most frequently used words in the file, sorted by number of occurrences. Any ideas?

Related: Count number of times string repeated in files perl

tchrist · Accepted Answer

From reading the fine perlfaq4(1) manpage, one learns how to sort hashes by value. So try this. It’s rather more idiomatically “perlian” than your approach.

#!/usr/bin/env perl    
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);

my %seen;
while (<>) {
    $seen{$_}++ for split /\W+/;  # or just split;
}

my $count = 0;
for (sort {
        $seen{$b} <=> $seen{$a}
                  ||
           lc($a) cmp lc($b)    # XXX: should be v5.16's fc() instead
                  ||
              $a  cmp  $b
     } keys %seen)
{
    next unless /\w/;
    printf "%-20s %5d
", $_, $seen{$_};
    last if ++$count > 100;
}

When run against itself, the first 10 lines of output are:

seen                     6
use                      5
_                        3
a                        3
b                        3
cmp                      2
count                    2
for                      2
lc                       2
my                       2

100 Most Used Strings in File

Answers (1)

Related Questions