Reputation: 927
How can I find the top 100 most used strings (words) in a .txt
file using Perl? So far I have the following:
use 5.012;
use warnings;
open(my $file, "<", "file.txt");
my %word_count;
while (my $line = <$file>) {
foreach my $word (split ' ', $line) {
$word_count{$word}++;
}
}
for my $word (sort keys %word_count) {
print "'$word': $word_count{$word}\n";
}
But this only counts each word, and organizes it in alphabetical order. I want the top 100 most frequently used words in the file, sorted by number of occurrences. Any ideas?
Related: Count number of times string repeated in files perl
Upvotes: 1
Views: 683
Reputation: 80423
From reading the fine perlfaq4(1) manpage, one learns how to sort hashes by value. So try this. It’s rather more idiomatically “perlian” than your approach.
#!/usr/bin/env perl
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);
my %seen;
while (<>) {
$seen{$_}++ for split /\W+/; # or just split;
}
my $count = 0;
for (sort {
$seen{$b} <=> $seen{$a}
||
lc($a) cmp lc($b) # XXX: should be v5.16's fc() instead
||
$a cmp $b
} keys %seen)
{
next unless /\w/;
printf "%-20s %5d\n", $_, $seen{$_};
last if ++$count > 100;
}
When run against itself, the first 10 lines of output are:
seen 6
use 5
_ 3
a 3
b 3
cmp 2
count 2
for 2
lc 2
my 2
Upvotes: 8