deadfish
deadfish

Reputation: 12304

What algorithms are used for seaching subwords in words?

I know, maybe this question is stupid but I am stuck and I need real help. I need in my project use algorithm that could find all words what starts by another and return all words' tails. For example: Find all words that start by: dad

In dictionary we have:

dada, dadaism, daddled, daddling

Result:

a, aism, dled, dling

I have dictionary with all words, so all what I need is only algorithm. Someone suggested me to use patricia algorithm but I couln't find any sample for C#. My dictionary is very big so I need find also very fast algorithm.


More information:

  • Dictionary is sorted.
  • Upvotes: 1

    Views: 1841

    Answers (6)

    bragboy
    bragboy

    Reputation: 35542

    You can use a TRIE for this. You can find a comprehensive implementation and a tutorial here.

    Basically, in this structure, you will end up traversing from Root to 'd' then to 'a' then to 'd'. You will reach to a point where you want all the words that are starting with 'dad'. Considering this as the root node now, all you have to do is explore all the possible paths underneath and there goes your algorithm

    Upvotes: 0

    Saeed Amiri
    Saeed Amiri

    Reputation: 22555

    The most famous one that I know is "knuth morris pratt string matching algorithm".

    If you take a look at the link, there are some others like Boyer–Moore string search algorithm, ... These are general algorithms, but if you are interested in special cases like start by, ... in most cases languages has this cases, for example in C# you can use StartsWith, EndsWith, there is no need to implement them again.

    Upvotes: 1

    user1088520
    user1088520

    Reputation:

    For loops may be quite fast for very small dictionaries. But if you have matching sets from thousands of words it will be very slow. Assuming that your dictionary is sorted (if not sort it) you can use the BinarySearch function to locate both the first and the last range items and then go with a for loop to create your results.

    To be more practical, I have a (sorted) dictionary with 354984 words including theese 35 words starting from dad: dad, dad's, dada, dadaism, dadaisms, dadaist, dadaistic, dadaistically, dadaists, dadap, dadas, dadburned, dadder, daddies, dadding, daddle, daddled, daddles, daddling, daddock, daddocky, daddums, daddy, daddynut, dade, dadenhudd, dading, dado, dadoed, dadoes, dadoing, dados, dadouchos, dads and daduchus. If I follow Jim's approach, I will have to perform 35 "StartsWith" which is OK. In case of "sat" prefix I have 228 words and in case of "cat" prefix I have 692 words. For the size of my dictionary I need a total of 40 string comparisons (worst case) to locate the first and the last items.

    If you are willing to use any trie implementation, be sure that supports at least numbers and dashes if your dictionary includes records like 1st or real-time.

    Upvotes: 1

    Jim Mischel
    Jim Mischel

    Reputation: 134005

    How you make this work will depend on how your dictionary is arranged. If it's a sorted list of words, then you can use binary search to find the first word that starts with "dad", and then loop through just those using StartsWith and Substring. That is:

    List<string> Words = LoadWords(); // however you load them
    Words.Sort();
    
    // Now, search for "dad" (or whatever)
    string prefix = "dad";
    
    int index = Words.BinarySearch(prefix);
    
    // If the returned index is negative, the word wasn't found.
    // The index is the one's compliment of the the place where it would be in the list.
    if (index < 0)
    {
        index = ~index;
    }
    
    for (int i = index; i < Count && Words[i].StartsWith(prefix))
    {
        Console.WriteLine(Words[i].Substring(prefix.Length));
    }
    

    This should be very fast. The sort is a one-time cost after loading. And you can eliminate it altogether if you store the dictionary in sorted order. The binary search is O(log n), where n is the number of words in the dictionary.

    If your dictionary is unordered, then you'll have to go through all the words, which is going to take a lot of time.

    There are other organizations for your dictionary, that will make it take a lot less space and that could potentially be faster. Those are somewhat more complicated and take a lot more time to build than creating a sorted list.

    Upvotes: 3

    wolfovercats
    wolfovercats

    Reputation: 1

    if you need something simple, you can try this:

            string[] dict = new string[] { "dada", "dadaism", "daddled", "daddling" };
            string prefix = "dad";
    
            var words = from d in dict
                        where d.StartsWith(prefix)
                        select d.Substring(prefix.Length);
    

    Upvotes: 0

    spender
    spender

    Reputation: 120480

    This sounds like a perfect use for a Trie/DAWG (directed acyclic word graph). I understand that Patricia tree is a trie-like variation. There's a nice article about Tries and an implementation here.

    Upvotes: 3

    Related Questions