Reputation: 51

Perl: Frequency of words and a top ten list of the words

Im working on making a perl script work, and beware I'm rather new to this..

Here's is what im trying to achieve: making a script that takes a .txt file and counts each word in the file. And when it's counted the words makes a list if the top 10 words in the file displaying how many times each word i counted.

well here's what ive got so far, ive been able to make the script count the words and how many times they appear. Now i need to make the top ten list and I don't really know where and how to do it. This is a homework assignment so I don't want/expect you to solve it for me, just give me some pointers in where to begin.

Thank you for helping (in advance)

Updated 15 oct

ok, it's sorting everything great but..

As it is now it's just printing everything in one line. I need it to print it like this:

4 word
3 next word
2 next word

Well you get it..

I think i've got it...i think :P

......................................

#! /usr/bin/perl

use utf8;


print ("Vilken fil?\n");
my $filen = @ARGV ? shift(@ARGV) : <STDIN>;
chomp $filen;

my %freq;

open my $DATA, $filen or die "Hittade inte den filen!";


while(<$DATA>) {

    s/[;:()".,!?]/ /gio;    
    foreach $word(split(' ', lc $_)) {  
    $freq{$word}++;                  
     }
}

@listing = (sort { $freq{$b} <=> $freq{$a} } keys %freq)[0..9];
foreach my $word (@listing )
    { print $freq{$word}." $word\n"; };

Upvotes: 3

Answers (3)

Tudor Constantin

Reputation: 26861

Building on Nate's answer, you can extract the top 10 elements, by using a slice:

@eldest = ( sort { $age{$b} <=> $age{$a} } keys %age)[0..9];

Upvotes: 2

Nate C-K

Reputation: 5932

Look at docs for the Perl sort function:

http://perldoc.perl.org/functions/sort.html

It has a form that lets you specify a block of code to define the ordering of elements. You can use this to order your list by frequency rather than by the word's alphabetical ordering.

The docs include this example:

# this sorts the %age hash by value instead of key
# using an in-line function
@eldest = sort { $age{$b} <=> $age{$a} } keys %age;

You should be able to adapt this pattern to your own problem.

Probably the most efficient way to get the top ten list is to keep track of the top ten as you go: each time you compute a count, check if it belongs in the top ten, and if so then insert it in the correct place, potentially knocking off the bottom item on the list. That way, you only need to track the ordering of ten words at a time regardless of how big the dictionary is. I don't know if you need this extra efficiency, though.

By the way, I have seen this kind of question in several job interviews, so it's a good thing to have a handle on.

Upvotes: 3

Grant Birchmeier

Reputation: 18494

Ha, by the time I finished reading your problem description I knew it was some kind of homework assignment! :)

For the next step, you have to scan through your %count hash and determine which words have the most occurrences.

The most naive way would be to scan through the list 10 times; each time, find the one with the highest count and store it in a top-ten list, then remove it from %count (or set it to 0 would also work).

If you want to get more ambitious, you could implement a sort function that sorts the %count entries, and then the 10 highest will be right together.

My Perl is rusty, but the Perl lib might even have some sort functions for you. In general, it's definitely worth your time to skim through a library reference to familiarize yourself on what's available.

Upvotes: -1

Perl: Frequency of words and a top ten list of the words

Answers (3)

Related Questions