Reputation: 2057
Can anyone save me? I have the following code:
private List<string> GenerateTerms(string[] docs)
{
List <string> uniques = new List<string>();
for (int i = 0; i < docs.Length; i++)
{
string[] tokens = docs[i].Split(' ');
List<string> toktolist = new List<string>(tokens.ToList());
var query = toktolist.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(20000);
foreach (string k in query)
{
if (!uniques.Contains(k))
uniques.Add(k);
}
}
return uniques;
}
It is about generate terms from number of documents based on the highest frequency. i did the same procedure using dictionary. in both cases spent 440 milliseconds. but the surprise was when i used the procedure with array list as in the following code
private ArrayList GenerateTerms(string[] docs)
{
Dictionary<string, int> yy = new Dictionary<string, int>();
ArrayList uniques = new ArrayList();
for (int i = 0; i < docs.Length; i++)
{
string[] tokens = docs[i].Split(' ');
yy.Clear();
for (int j = 0; j < tokens.Length; j++)
{
if (!yy.ContainsKey(tokens[j].ToString()))
yy.Add(tokens[j].ToString(), 1);
else
yy[tokens[j].ToString()]++;
}
var sortedDict = (from entry in yy
orderby entry.Value descending
select entry).Take(20000).ToDictionary
(pair => pair.Key, pair => pair.Value);
foreach (string k in sortedDict.Keys)
{
if (!uniques.Contains(k))
uniques.Add(k);
}
}
return uniques;
}
it spent 350 milliseconds. shouldn't Array list be slower than List and dictionary ?? please save me with this tense.
Upvotes: 0
Views: 1762
Reputation: 22230
I like Mark's solution. However, I thought that if you leverage a Dictionary properly you could squeeze out some more performance. Sure enough, this is a lot faster...
private static List<string> GenerateTerms(string[] docs)
{
var termsDictionary = new Dictionary<string, int>();
foreach (var doc in docs)
{
var terms = doc.Split(' ');
int uniqueTermsCount = 0;
foreach (string term in terms)
{
if (termsDictionary.ContainsKey(term))
termsDictionary[term]++;
else
{
uniqueTermsCount++;
termsDictionary[term] = 1;
}
}
if (uniqueTermsCount >= 20000)
break;
}
return (from entry in termsDictionary
orderby entry.Value descending
select entry.Key).ToList();
}
To explain briefly, termsDictionary
holds a Dictionary of terms and the number of times each term is repeated. Then the Linq query at the end returns the terms in descending order by the number of occurrences.
UPDATE
I added code to limit the unique number of terms to 20,000 per doc.
Here are the benchmarking results...
Below is the code I used to generate the docs
array and run the tests...
static void Main(string[] args)
{
string[] docs = new string[50000];
for (int i = 0; i < docs.Length; i++)
{
docs[i] = "a man a plan a canal panama";
}
// warm up (don't time this)
GenerateTermsOriginal(docs);
Stopwatch sw = new Stopwatch();
sw.Restart();
var t1 = GenerateTermsOriginal(docs);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds + " ms");
sw.Restart();
var t2 = GenerateTermsLinq(docs);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds + " ms");
sw.Restart();
var t3 = GenerateTermsDictionary(docs);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds + " ms");
}
Upvotes: 1
Reputation: 838376
Your code does a lot of unnecessary work and uses inefficient data structures.
Try this instead:
private List<string> GenerateTerms(string[] docs)
{
var result = docs
.SelectMany(doc => doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(20000))
.Distinct()
.ToList();
return result;
}
Refactored version to make it easier to read:
private List<string> GenerateTerms(string[] docs)
{
return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
}
private IEnumerable<string> ProcessDocument(string doc)
{
return doc.Split(' ')
.GroupBy(word => word)
.OrderByDescending(g => g.Count())
.Select(g => g.Key)
.Take(10000);
}
Upvotes: 5