Reputation:
I was once asked by a current employee how I would develop a frequency-sorted list of the ten thousand most-used words in the English language. Suggest a solution in the language of your choosing, though I prefer C#.
Please provide not only an implementation, but also an explanation.
Thanks
Upvotes: 1
Views: 1857
Reputation: 15124
As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data, intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000
.
Upvotes: 2
Reputation: 421978
IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
.Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
.OrderByDescending(word => word.Frequency);
Explanation: I don't really know if it requires further explanation but I'll try. inputList
is an array or any other collection providing source words. GroupBy
function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word
). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word
and Frequency
properties and sorted by Frequency
property in descending order. You could use .Take(10000)
to take the first 10000. The whole thing can be easily parallelized by .AsParallel()
provided by PLINQ. The query operator syntax might look clearer:
var mostFrequentlyUsed =
(from word in inputList
group word by word into wordGroup
select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
.OrderByDescending(word => word.Frequency).Take(10000);
Upvotes: 3
Reputation: 2398
First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)
#!/usr/bin/perl
my %wordFreq
foreach ( my $word in @words)
{
$wordFreq{$word}++;
}
my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}
Upvotes: 1
Reputation: 284786
I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol. There are also other approaches. See Is there a .NET equivalent to Apache Hadoop? and https://stackoverflow.com/questions/436686/-net-mapreduce-implementation.
Upvotes: 1