Michael Low
Michael Low

Reputation: 24506

Generating combinations of substrings from a string

I'm trying to is generate all possible syllable combinations for a given word. The process for identifying what's a syllable isn't relevant here, but it's the generating of all combinations that's giving me a problem. I think this is probably possible to do recursively in a few lines I think (though any other way is fine), but I'm having trouble getting it working. Can anyone help ?

    // how to test a syllable, just for the purpose of this example
    bool IsSyllable(string possibleSyllable) 
    {
        return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$");
    }

    List<string> BreakIntoSyllables(string word)
    {
       // the code here is what I'm trying to write 
       // if 'word' is "misunderstand" , I'd like this to return
       //  => {"mis","und","er","stand"},{ "mis","un","der","stand"}
       // and for any other combinations to be not included
    }

Upvotes: 6

Views: 1305

Answers (3)

xanatos
xanatos

Reputation: 111810

Normally this type of problems is solved using Tries. I will base my implementation of a Trie on How to create a trie in c# (but notice that I have rewritten it).

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" });
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" });

var word = "unquestionable";

var parts = new List<List<string>>();

Split(word, 0, trie, trie.Root, new List<string>(), parts);

//

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts)
{   
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if).
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if)
    if (node.IsTerminal)
    {
        // Add the syllable to the current list of syllables
        currentParts.Add(node.Word);

        // "covered" the word with syllables
        if (index == word.Length)
        {
            // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time.
            parts.Add(new List<string>(currentParts));
        }
        else
        {
            // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root.
            Split(word, index, trie, trie.Root, currentParts, parts);
        }

        // Remove the syllable from the current list of syllables
        currentParts.RemoveAt(currentParts.Count - 1);
    }

    // We have covered all the word with letters. No more work to do in this subiteration
    if (index == word.Length)
    {
        return;
    }

    // Here we try to find the edge corresponding to the current character

    TrieNode nextNode;

    if (!node.Edges.TryGetValue(word[index], out nextNode))
    {
        return;
    }

    Split(word, index + 1, trie, nextNode, currentParts, parts);
}

public class Trie
{
    public readonly TrieNode Root = new TrieNode();

    public Trie()
    {
    }

    public Trie(IEnumerable<string> words)
    {
        this.AddRange(words);
    }

    public void Add(string word)
    {
        var currentNode = this.Root;

        foreach (char ch in word)
        {
            TrieNode nextNode;

            if (!currentNode.Edges.TryGetValue(ch, out nextNode))
            {
                nextNode = new TrieNode();
                currentNode.Edges[ch] = nextNode;
            }

            currentNode = nextNode;
        }

        currentNode.Word = word;
    }

    public void AddRange(IEnumerable<string> words)
    {
        foreach (var word in words)
        {
            this.Add(word);
        }
    }
}

public class TrieNode
{
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>();
    public string Word { get; set; }

    public bool IsTerminal
    {
        get
        {
            return this.Word != null;
        }
    }
}

word is the string you are interested in, parts will contain the list of lists of possible syllables (it would probably be more correct to make it a List<string[]>, but it's quite easy to do it. Instead of parts.Add(new List<string>(currentParts)); write parts.Add(currentParts.ToArray()); and change all the List<List<string>> to List<string[]>.

I'll add a variant of Enigmativity response thas is theretically faster than his because it discards wrong syllables immediately instead of post-filtering them later. If you like it, you should give him a +1, because without his idea, this variant wouldn't be possible. But note that it's still an hack. The "right" solution is in using Trie(s) :-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$");

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        (
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)
        let e = t.Substring(n)
        let f = splitter(e)
        from g in f
        select (new[] { s }).Concat(g).ToArray()
        )
        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

var parts = splitter(word).ToList();

An explanation:

        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        where isSyllable(s)

We calculate all the possible syllables of a word, from length 1 to the length of the word - 1 and check if it's a syllable. We weed out directly the non-syllables. The full word as a syllable will be checked later.

        let e = t.Substring(n)
        let f = splitter(e)

We search the syllables of the remaining part of the string

        from g in f
        select (new[] { s }).Concat(g).ToArray()

And we chain the found syllables with the "current" syllable. Note that we are creating many useless arrays. If we accept to have an IEnumerable<IEnumerable<string>> as the result, we can take away this ToArray.

(we could rewrite many rows together, deleting many let, like

        from g in splitter(t.Substring(n))
        select (new[] { s }).Concat(g).ToArray()

but we won't do it for clarity)

And we concatenate the "current" syllable with the found syllables.

        .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

Here we could rebuild the query a little so to not use this Concat and create empty arrays, but it would be a little complex (we could rewrite the entire lambda function as a isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction)

In the end, if the whole word is a syllable, we add the whole word as a syllable. Otherwise we Concat an empty array (so no Concat)

Upvotes: 4

Enigmativity
Enigmativity

Reputation: 117010

Try starting with this:

var word = "misunderstand";

Func<string, bool> isSyllable =
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$");

var query =
    from i in Enumerable.Range(0, word.Length)
    from l in Enumerable.Range(1, word.Length - i)
    let part = word.Substring(i, l)
    where isSyllable(part)
    select part;

This returns:

misunderstand-results

Does that help to begin with at least?


EDIT: I thought a bit further about this problem an came up with this couple of queries:

Func<string, IEnumerable<string[]>> splitter = null;
splitter =
    t =>
        from n in Enumerable.Range(1, t.Length - 1)
        let s = t.Substring(0, n)
        let e = t.Substring(n)
        from g in (new [] { new [] { e } }).Concat(splitter(e))
        select (new [] { s }).Concat(g).ToArray();

var query =
    from split in (new [] { new [] { word } }).Concat(splitter(word))
    where split.All(part => isSyllable(part))
    select split;

Now query return this:

misunderstanding-results2

Let me know if that's nailed it now.

Upvotes: 4

Noel Kennedy
Noel Kennedy

Reputation: 12258

You might have problems with scaling this to be honest, I'm not sure how big your dataset is but a solution based on a simple 'is this a a syllable?' you will need to call your 'syllable detection' routine roughly 0(n*n) for each word where n = the number of characters in the word (if this is meaningless, it means that it's likely to be slow for large datasets!). This doesn't take into account the scalability of the detection algorithm which might also slow down as you add more syllables. .

I know you have said your process for identifying what's a syllable or not isn't relevant, but lets say you can change it to have it work more like autocompletion, ie pass in the beginning of a syllable and have it tell you all the syllables that are possible from this point would be much more scalable. Have a look at replacing it with a trie if performance gets out of hand.

Upvotes: 0

Related Questions