Pece
Pece

Reputation: 637

Faster Algorithm for string comparing in c#

I have two sentences that needed to be compared to each-other. The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10. That is 1.000.000 loops, which in my algorithm is very slow.

This is the algorithm that I am using:

private double BreakStringsAndCheck(string s1, string s2)
{
    if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
        return (double)0;
    string[] firstArray = s1.Split(' ');
    string[] secondArray = s2.Split(' ');
    if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }
    double value = 0;
    for (int i = 0; i < firstArray.Length; i++)
        for (int j = 0; j < secondArray.Length; j++)
            value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
    return findLongest ? value : value / firstArray.Length;
}

It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.

Can some one think of another method or logic that is much faster than this?

Update:

I will try to explain the problem with more details. I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database. This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.

I need the records that has more than 70% word match.

I hope that I'm clear now.

Upvotes: 11

Views: 3144

Answers (8)

The Archetypal Paul
The Archetypal Paul

Reputation: 41779

If you split the 10 records first, then you're finding a small number of strings in many larger strings. This seems to fit http://en.wikipedia.org/wiki/String_searching_algorithm#Algorithms_using_finite_set_of_patterns

and the Aho-Corasick algorithm might work well for you

How long are the records?

EDIT:

This is an unnecessary switcharound - your comparison is symmetric wrt firstArray and secondArray

 if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }

instead, replace the return with

return findLongest ? value : (firstArray.Length > secondArray.Length) ? value/secondArray.length : value / firstArray.Length);

only with something more readable :)

UPDATE after question update

So you could pre-process the 100,000 (e.g. to hash the words)? And only 10-20 change per day so keeping the preprocessed data up to date would be easy.

You definitely need to do something that uses the relatively-static nature of the 100,000. Even if you did the pre-processing just once per day, you could do the comparison with all of last days' records, then use your current slowish approach for any others added since the last preprocessing run. From what you say, there will be at most 10-20 of those

I think either the hashing idea, or building a Aho-Comisack trie from the corpus would give you much faster searching.

Upvotes: 0

Jonas Elfstr&#246;m
Jonas Elfstr&#246;m

Reputation: 31458

Intersect example

private double BreakStringsAndCheck(string s1, string s2)
{
    var split1 = s1.Split(' ');
    return (double)split1.Intersect(s2.Split(' ')).Count() / split1.Count() * 100.0;
}

I would prefer to return the ratio 0.4 instead of 40.0 for:

var percent = BreakStringsAndCheck("Jan Banan går till GAIS.", "I Torsk på Tallin så var en annan Jan Banan med.");

I just realized that your algorithm always compares the shorter string to the longer. So your algorithm would return 40.0 even if the input parameters are switched like this

var percent = BreakStringsAndCheck("I Torsk på Tallin så var en annan Jan Banan med.", "Jan Banan går till GAIS.");

but my intersect example will return 18.18. I feel that this is more correct but if you really want your way then just add

if (s1.Length > s2.Length)
{
    var tmp = s2;
    s2 = s1;
    s1 = tmp;
}

to the beginning of the method.

Presplitting

var presplits = new List<string[]>() { s1.Split(' '), s2.Split(' '), s3.Split(' ') };

...

private static IEnumerable<double> StringsInString(IEnumerable<string[]> strings, string s2)
{
    return strings.Select(h => (double)h.Intersect(s2.Split(' ')).Count() / h.Count());
}

then loop over all your 100.000 strings in a Parallel.For.

PS. I think that you will have to downcase and remove ., , and so on from the strings to get a more correct ratio. DS.

Upvotes: 0

Kam
Kam

Reputation: 350

As the data is in the database, can you not do the work in the database?

Shred the sentences to words against sentence row.

Join your words against the shredded words. This should allow you to see which sentences have a matching word.

If you then group and sum them by the sentence id you should get the sum of words that match in the specified sentence against stored sentences.

I would look to shredding your data beforehand. Use them as indexes against your main sentence table.

Upvotes: 0

Binary Worrier
Binary Worrier

Reputation: 51719

Try this.

Before performing any comparisons, preprocess the 100,000 rows. Every word in the 100,000 rows is going to be a key in a Dictionary<> object, the value is going to be a list of id's (the id's of each row that word occurs on), e.g.

Dictionary<string, List<int>> allWords

When "searching for a match", you keep a second dictionary, this one is keyed by row id, and it's value is an integer you'll increment. e.g.

Dictionary<int, int> matches

You split the search string into words, and for each row id for each word you increment the value for that row id.

var searchWords = search.Split(" ");
foreach(var word in searchWord)
{
    foreach(var id in allWords[word])
        matches[id] += 1;
}
var bestRowId = (from m in matches orderby m.Value select m.Key).Last();

The row id with the greatest value is the best match.

It'll take some time up front to build the dictionary (but I'd guess not much more than for a single comparison), but it will be blindingly fast after that.

NB: The key to the speed here is that Dictionary will use the HashCode of the key it's storing, and the .net hash function for strings is excellent.

Update

If pre-processing on this order takes too long, then you can do a lighter pre-process.
As you read each of the 100,000 rows, split it into words, and sort the array of words. Then as you compare, split the string to compare and sort it also. Your function then saves time because it isn't splitting each string multiple times, and your nested loops can be replaced with a loop for min(words1.length, words2.length).

Upvotes: 0

smirkingman
smirkingman

Reputation: 6368

Here's a different approach. I'm guessing that when you compare 10 sentences against 100'000 sentences, there are going to be a large number where no words match and % = 0. Instead of always performing 100'000 comparisons, find those sentences in the 100'000 where at least one word matches and only compare them.

Create (once) a dictionary of all the words in the 100'000 sentences.

Each entry is a list L of sentences containing this word.

tobetested=empty
For each s in the 10 sentences
  for each word in s
    if dictionary.contains(word) then
      add members of L that aren't already there to tobetested
  next
  for each sentence to tobetested ' hopefully much less than 100'000
    compare using your algorithm
  next
next

Upvotes: 0

Samuel
Samuel

Reputation: 2613

Personally I'd avoid creating the two arrays; the memory allocations will kill performance.

Try looking at the string.IndexOf function to find where the next space is in both strings, subtract that from the previous space location to work out the word length. If the two lengths are equal then use string.Compare to see if the two sub-strings are equal. This will avoid memory allocations and only iterate through the strings once, so should be faster.

Also, as others have mentioned, definitely look at using the Parallel extensions.

Upvotes: 2

D.Shawley
D.Shawley

Reputation: 59623

I'm not a C# programmer, but here are a few general tips:

  1. Move the floating point arithmetic out of the loop. You should be able to count the characters that match and do the division later.
  2. You should be able to run each "long" loop in a separate thread of execution since the data is static. I would spawn a separate thread for each of your "10" sentences and run them in parallel.
  3. You might want to remove the call to split if you can. Basically, remove any extra memory allocations.

The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.

Upvotes: 6

Ahmad
Ahmad

Reputation: 24947

Have you looked at the Intersect method as an alternative. I have no idea about its performance but it looks like it may work

Upvotes: 2

Related Questions