Reputation: 637
I have two sentences that needed to be compared to each-other. The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10. That is 1.000.000 loops, which in my algorithm is very slow.
This is the algorithm that I am using:
private double BreakStringsAndCheck(string s1, string s2)
{
if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
return (double)0;
string[] firstArray = s1.Split(' ');
string[] secondArray = s2.Split(' ');
if (firstArray.Length > secondArray.Length)
{
string[] tempArray = firstArray;
firstArray = secondArray;
secondArray = tempArray;
}
double value = 0;
for (int i = 0; i < firstArray.Length; i++)
for (int j = 0; j < secondArray.Length; j++)
value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
return findLongest ? value : value / firstArray.Length;
}
It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.
Can some one think of another method or logic that is much faster than this?
Update:
I will try to explain the problem with more details. I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database. This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.
I need the records that has more than 70% word match.
I hope that I'm clear now.
Upvotes: 11
Views: 3144
Reputation: 41779
If you split the 10 records first, then you're finding a small number of strings in many larger strings. This seems to fit http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns
and the Aho-Corasick algorithm might work well for you
How long are the records?
EDIT:
This is an unnecessary switcharound - your comparison is symmetric wrt firstArray and secondArray
if (firstArray.Length > secondArray.Length)
{
string[] tempArray = firstArray;
firstArray = secondArray;
secondArray = tempArray;
}
instead, replace the return with
return findLongest ? value : (firstArray.Length > secondArray.Length) ? value/secondArray.length : value / firstArray.Length);
only with something more readable :)
UPDATE after question update
So you could pre-process the 100,000 (e.g. to hash the words)? And only 10-20 change per day so keeping the preprocessed data up to date would be easy.
You definitely need to do something that uses the relatively-static nature of the 100,000. Even if you did the pre-processing just once per day, you could do the comparison with all of last days' records, then use your current slowish approach for any others added since the last preprocessing run. From what you say, there will be at most 10-20 of those
I think either the hashing idea, or building a Aho-Comisack trie from the corpus would give you much faster searching.
Upvotes: 0
Reputation: 31458
Intersect example
private double BreakStringsAndCheck(string s1, string s2)
{
var split1 = s1.Split(' ');
return (double)split1.Intersect(s2.Split(' ')).Count() / split1.Count() * 100.0;
}
I would prefer to return the ratio 0.4 instead of 40.0 for:
var percent = BreakStringsAndCheck("Jan Banan går till GAIS.", "I Torsk på Tallin så var en annan Jan Banan med.");
I just realized that your algorithm always compares the shorter string to the longer. So your algorithm would return 40.0 even if the input parameters are switched like this
var percent = BreakStringsAndCheck("I Torsk på Tallin så var en annan Jan Banan med.", "Jan Banan går till GAIS.");
but my intersect example will return 18.18. I feel that this is more correct but if you really want your way then just add
if (s1.Length > s2.Length)
{
var tmp = s2;
s2 = s1;
s1 = tmp;
}
to the beginning of the method.
Presplitting
var presplits = new List<string[]>() { s1.Split(' '), s2.Split(' '), s3.Split(' ') };
...
private static IEnumerable<double> StringsInString(IEnumerable<string[]> strings, string s2)
{
return strings.Select(h => (double)h.Intersect(s2.Split(' ')).Count() / h.Count());
}
then loop over all your 100.000 strings in a Parallel.For
.
PS. I think that you will have to downcase and remove .
, ,
and so on from the strings to get a more correct ratio.
DS.
Upvotes: 0
Reputation: 350
As the data is in the database, can you not do the work in the database?
Shred the sentences to words against sentence row.
Join your words against the shredded words. This should allow you to see which sentences have a matching word.
If you then group and sum them by the sentence id you should get the sum of words that match in the specified sentence against stored sentences.
I would look to shredding your data beforehand. Use them as indexes against your main sentence table.
Upvotes: 0
Reputation: 51719
Try this.
Before performing any comparisons, preprocess the 100,000 rows.
Every word in the 100,000 rows is going to be a key in a Dictionary<>
object, the value is going to be a list of id's (the id's of each row that word occurs on), e.g.
Dictionary<string, List<int>> allWords
When "searching for a match", you keep a second dictionary, this one is keyed by row id, and it's value is an integer you'll increment. e.g.
Dictionary<int, int> matches
You split the search string into words, and for each row id for each word you increment the value for that row id.
var searchWords = search.Split(" ");
foreach(var word in searchWord)
{
foreach(var id in allWords[word])
matches[id] += 1;
}
var bestRowId = (from m in matches orderby m.Value select m.Key).Last();
The row id with the greatest value is the best match.
It'll take some time up front to build the dictionary (but I'd guess not much more than for a single comparison), but it will be blindingly fast after that.
NB: The key to the speed here is that Dictionary will use the HashCode of the key it's storing, and the .net hash function for strings is excellent.
Update
If pre-processing on this order takes too long, then you can do a lighter pre-process.
As you read each of the 100,000 rows, split it into words, and sort the array of words. Then as you compare, split the string to compare and sort it also.
Your function then saves time because it isn't splitting each string multiple times, and your nested loops can be replaced with a loop for min(words1.length, words2.length)
.
Upvotes: 0
Reputation: 6368
Here's a different approach. I'm guessing that when you compare 10 sentences against 100'000 sentences, there are going to be a large number where no words match and % = 0. Instead of always performing 100'000 comparisons, find those sentences in the 100'000 where at least one word matches and only compare them.
Create (once) a dictionary of all the words in the 100'000 sentences.
Each entry is a list L of sentences containing this word.
tobetested=empty
For each s in the 10 sentences
for each word in s
if dictionary.contains(word) then
add members of L that aren't already there to tobetested
next
for each sentence to tobetested ' hopefully much less than 100'000
compare using your algorithm
next
next
Upvotes: 0
Reputation: 2613
Personally I'd avoid creating the two arrays; the memory allocations will kill performance.
Try looking at the string.IndexOf function to find where the next space is in both strings, subtract that from the previous space location to work out the word length. If the two lengths are equal then use string.Compare to see if the two sub-strings are equal. This will avoid memory allocations and only iterate through the strings once, so should be faster.
Also, as others have mentioned, definitely look at using the Parallel extensions.
Upvotes: 2
Reputation: 59623
I'm not a C# programmer, but here are a few general tips:
split
if you can. Basically, remove any extra memory allocations.The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.
Upvotes: 6