ultraman
ultraman

Reputation:

How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

I was once asked by a current employee how I would develop a frequency-sorted list of the ten thousand most-used words in the English language. Suggest a solution in the language of your choosing, though I prefer C#.

Please provide not only an implementation, but also an explanation.

Thanks

Upvotes: 1

Views: 1857

Answers (4)

Darius Bacon
Darius Bacon

Reputation: 15124

As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000.

Upvotes: 2

Mehrdad Afshari
Mehrdad Afshari

Reputation: 421978

IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);

Upvotes: 3

cyberconte
cyberconte

Reputation: 2398

First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}

Upvotes: 1

Matthew Flaschen
Matthew Flaschen

Reputation: 284786

I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See Is there a .NET equivalent to Apache Hadoop? and https://stackoverflow.com/questions/436686/-net-mapreduce-implementation.

Upvotes: 1

Related Questions