Reputation: 13367
I have a function which works a very slow for my task (it must be 10-100 times faster)
Here is code
public long Support(List<string[]> sequences, string[] words)
{
var count = 0;
foreach (var sequence in sequences)
{
for (int i = 0; i < sequence.Length - words.Length + 1; i++)
{
bool foundSeq = true;
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
if (foundSeq)
{
count++;
break;
}
}
}
return count;
}
public void Support(List<string[]> sequences, List<SequenceInfo> sequenceInfoCollection)
{
System.Threading.Tasks.Parallel.ForEach(sequenceInfoCollection.Where(x => x.Support==null),sequenceInfo =>
{
sequenceInfo.Support = Support(sequences, sequenceInfo.Sequence);
});
}
Where List<string[]> sequences
is a array of array of words. This array usually contains 250k+ rows. Each row is about 4-7 words. string[] words
is a array of words(all words contains in sequences at least one time) which we trying to count.
The problem is foundSeq = foundSeq && sequence[i + j] == words[j];
. This code take most of all execution time(Enumerable.MoveNext at second place). I want to hash all words in my array. Numbers compares faster then strings, right? I think it can help me to get 30%-80% of perfomance. But i need 10x! What can i to do? If you want to know it's a part of apriory algorithm.
Support function check if the words sequence is a part any sequence in the sequences list and count how much times.
Upvotes: 1
Views: 246
Reputation: 77454
Knuth–Morris–Pratt algorithm
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.
From Wikipedia: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
This is one of the improvements that you should make. With a small difference: a "word" in your code is a "characters" in the terminology of the algorithm; your "words" array is what is a word in KMP.
The idea is that when you search for "abc def ghi jkl", and have matched "abc def ghi" already, but the next word does not match, you can jump three positions.
Search: abc def ghi jkl
Text: abc def ghi klm abc def ghi jkl
i=0: abc def ghi jkl?
skip 2: XXX XXX <--- you save two iterations here, i += 2
i=2: abc?
i=3: abc? ...
Upvotes: 2
Reputation: 42443
Theoretically, you could concatenate each sequence and using substring matching. I don't have a compiler at hand now, so I can't test whether it will actually improve performance, but this is the general idea:
List<string> sentences = sequences.Select(seq => String.Join(" ", seq));
string toMatch = String.Join(" ", words);
return sentences.Count(sentence => sentence.Contains(toMatch));
Upvotes: 0
Reputation: 34200
The first optimisation I would make is an early-fail. Your inner loop continues through the whole sequence even if you know it's failed, and you're doing an unnecessary bit of boolean logic. Your code is this:
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
Instead, just do this (or equivalent):
for (int j = 0; j < words.Length; j++)
{
if (sequence[i + j] != words[j])
{
foundSeq = false;
break;
}
}
This will save you the majority of your comparisons (you will drop out at the first word if it doesn't match, instead of continuing to compare when you know the result is false). It could even make the tenfold difference you're looking for if you expect the occurrence of the individual words in each sequence to be low (if, say, you are finding sentences in a page of English text).
Upvotes: 0