Check list for similar items

Question

Given the following list

simple_list = ['abcX1', 'def', 'abc', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']

How can I delete items, which start the same but do not have a suffix.

Example:

Of 'abcX1', 'abc, 'abcX2', 'abcY' I only want to keep 'abcX1', 'abcX2', 'abcY'
Of 'def', 'defX1' I want to keep 'defX1'
Of 'efg', 'fgh' I want to keep both, as they do not have a counterpart with a suffix

My thoughts:

for i in simple_list:
   for j in simple_list:
      Here I am stuck

Furthermore, my attempt does not seem very pythonic. So suggestions would be highly appreciated.

tobias_k · Accepted Answer

For short lists, you can just use a nested list comprehension with any, checking whether the current element is a prefix of any other element (except itself).

lst = ['abcX1', 'def', 'abc', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']

res = [x for x in lst if not any(x != y and y.startswith(x) for y in lst)]

If the list contains duplicate elements, all the duplicates will be kept; in this case, you should compare indices instead of the elements themselves, or create a set from the elements first.)

Note, however, that this will have quadratic complexity, i.e. O(n²), which may be prohibitive for longer lists. Better (but also a bit more complicated) would be to create a Trie, or prefix tree, from all the words, and then use only the leafs of that tree.

tree = {}
for w in lst:
    t = tree
    for c in w:
        t = t.setdefault(c, {})

def leafs(t, p=""):
    return [x for k in t for x in leafs(t[k], p+k)] if t else [p]

res = list(leafs(tree))

The complexity for creating the tree and getting the leafs should be O(n*k), with n being the number of words and k the avg. number of letters in each word.¹

Another way, a bit simpler: Note that if x is a prefix of y, then x will be sorted right before y, so you can sort the list and then just compare consecutive values. The complexity for this will be O(n logn) for sorting and then O(n) for filtering.

srt = sorted(lst)
res = [x for i, x in enumerate(srt)
         if i == len(srt) - 1 or not srt[i+1].startswith(x)]

Both the tree and sorting-approach may scramble the order of elements in the result. If you want the elements in their original order, you can create a set from the filtered elements and then do another pass over the original list, keeping the elements that are in the set. This will add another O(n):

res_set = set(res)
res = [x for x in lst if x in res_set]

Each way, the result res is ['abcX1', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']. Also, they all will create a new, filtered list, but not modify the original.

I did some more timing analysis. For your list, as expected there is not too much of a difference. Interestingly, the prefix-tree is slowest, probably since the overhead of creating all those dictionaries beats the O(n²) on the short list. Sorting is fastest.

>>> %timeit test.simple(lst)
10000 loops, best of 5: 44.3 µs per loop
>>> %timeit test.prefixtree(lst)
10000 loops, best of 5: 51.6 µs per loop
>>> %timeit test.sorting(lst)
100000 loops, best of 5: 11.2 µs per loop

More importantly, on a list of 1000 words, the prefix-tree -- now much faster than the nested loop -- is still considerably slower than sorting, despite its lower complexity. My explanation for this is that the O(n logn) of the builtin sorted is implemented in fast C whereas the O(n) of the tree is still slow Python.

>>> %timeit test.simple(words)
1 loop, best of 5: 562 ms per loop
>>> %timeit test.prefixtree(words)
100 loops, best of 5: 8.42 ms per loop
>>> %timeit test.sorting(words)
1000 loops, best of 5: 1.23 ms per loop

¹ I added the k factor here as it is more obvious in this algorithm, but of course using startswith is also not O(1) but O([length of actual prefix])

Check list for similar items

Answers (2)

Related Questions