Reputation: 10159

Find smallest subset prefixes

Here is my code for this problem. I am using Trie tree for this solution, and wondering if any other better ideas in terms of better time complexity or space complexity. Also any bugs and code style advice are appreciated.

Problem:

Given a set of strings, return the smallest subset of the given input word set --- that contains prefixes for every input words in the given input word set. The prefix should be a complete input word in the given input set, other than a prefix of a given word, For a word, which has no prefix, return itself.

If the list is ['foo', 'foog', 'food', 'asdf'] return ['foo', 'asdf']

The return is foo since foo is prefix for foo (itself), prefix for foog and prefix for food (in other words, foo could "represent" longer string like foog and food). Output also contains asdf because it is not prefix for any other words in the input list, so output itself.

The empty set is not a correct answer because it does not contain the longest possible prefixes.

Source code:

from collections import defaultdict
class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.isEnd = False
    def insert(self, word):
        node = self
        for w in word:
            node = node.children[w]
        node.isEnd = True
    def find_prefix(self, prefix, result):
        if self.isEnd:
            result.append(prefix[:])
            return
        for k,v in self.children.items():
            prefix.append(k)
            v.find_prefix(prefix, result)
            prefix.pop(-1)

if __name__ == "__main__":
    words = ['foo', 'foog', 'food', 'asdf']
    root = TrieNode()
    for w in words:
        root.insert(w)
    result = []
    root.find_prefix([], result)
    print result

Upvotes: 0

Answers (2)

2ps

Reputation: 15956

I prefer the simpler while-loop approach with a sort at the beginning:

minimal = []
words = ['foo', 'foog', 'food', 'asdf']
words.sort(key=lambda x: (len(x), x))
while words:
    word = words[0]
    minimal.append(word)
    words = [ x for x in words[1:] if not x.startswith(word) ]
print minimal

It is a fairly efficient implementation running in at worst O(n**2) when no string is a prefix of any other string.

Postscript #1: you can make the sort slightly more efficient by only sorting on the length of the words instead of both length and alphabetically. e.g., changing this line:

    words.sort(key=lambda x: (len(x), x))

to:

    words.sort(key=lambda x: len(x))

Of course a sort is O(n(log n)) which is the lower bound on the running-time/complexity.

Postscript #2:

If you prefer defined memory characteristics you can use marking instead of filtering on the words list. A marking version of this algorithm would look like this:

    words = [ 'foo', 'foog', 'food', 'asdf' ]
    words.sort(key=lambda x: len(x))
    marked = [ False for _ in words ]
    for i in range(0, len(words)):
        is_marked = marked[i]
        if is_marked: continue 
        word = words[i]

        for j in range(i + 1, len(words)):
            if not marked[j] and words[j].startswith(word):
                marked[j] = True
    minimal = [ word for word, is_marked in zip(words, marked) if not is_marked ]

It is slightly more verbose than my preferred filtering version but has the benefit of not continually creating/destroying the words array in each successive pass of the loop.

Upvotes: 1

danh

Reputation: 62676

I think the question is unambiguous. (Maybe it was tougher before the edits). The answer is that trie seems exactly right.

Build a trie from the input words, then traverse it depth first. Each time you find a node (an inner node or a leaf) that's in the input set, add the word at that node to the output and cease searching it's children.

Upvotes: 1

Find smallest subset prefixes

Answers (2)

Related Questions