Reputation: 25834

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?

Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.

E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.

Oh, and please don't mention regular expressions as a suggested solutions.

It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.

Edit:

A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:

Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.

I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

Upvotes: 4

Answers (7)

Ray

Reputation: 3109

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.

That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Upvotes: 0

BlueMonkMN

Reputation: 25601

If you're truly just looking for space-delimited words, this code would be a very simple implementation:

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }

Upvotes: 1

LBushkin

Reputation: 131676

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.

As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...

One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.

The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.

The general search algorithm is:

For each string you are searching against:

Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)

Here's an example using a sorted array and binary search in C# 2.0:

NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

Here's a version that uses a dictionary-based index and LINQ in C# 3.0:

NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

Upvotes: 2

plinth

Reputation: 49189

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.

So you get something like:

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

Upvotes: 0

Nick

Reputation: 3337

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Upvotes: 0

Reed Copsey

Reputation: 564413

If you add this extension method:

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

Then this becomes a one line statement:

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

Upvotes: 1

devuxer

Reputation: 42364

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

Upvotes: 0

C# Efficient Substring with many inputs

Answers (7)

Related Questions