Reputation: 51
Given the following list
simple_list = ['abcX1', 'def', 'abc', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']
How can I delete items, which start the same but do not have a suffix.
Example:
'abcX1', 'abc, 'abcX2', 'abcY'
I only want to keep 'abcX1', 'abcX2', 'abcY'
'def', 'defX1'
I want to keep 'defX1'
'efg', 'fgh'
I want to keep both, as they do not have a counterpart with a suffixMy thoughts:
for i in simple_list:
for j in simple_list:
Here I am stuck
Furthermore, my attempt does not seem very pythonic. So suggestions would be highly appreciated.
Upvotes: 0
Views: 136
Reputation: 82929
For short lists, you can just use a nested list comprehension with any
, checking whether the current element is a prefix of any other element (except itself).
lst = ['abcX1', 'def', 'abc', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']
res = [x for x in lst if not any(x != y and y.startswith(x) for y in lst)]
If the list contains duplicate elements, all the duplicates will be kept; in this case, you should compare indices instead of the elements themselves, or create a set from the elements first.)
Note, however, that this will have quadratic complexity, i.e. O(n²), which may be prohibitive for longer lists. Better (but also a bit more complicated) would be to create a Trie, or prefix tree, from all the words, and then use only the leafs of that tree.
tree = {}
for w in lst:
t = tree
for c in w:
t = t.setdefault(c, {})
def leafs(t, p=""):
return [x for k in t for x in leafs(t[k], p+k)] if t else [p]
res = list(leafs(tree))
The complexity for creating the tree and getting the leafs should be O(n*k), with n being the number of words and k the avg. number of letters in each word.1
Another way, a bit simpler: Note that if x
is a prefix of y
, then x
will be sorted right before y
, so you can sort the list and then just compare consecutive values. The complexity for this will be O(n logn) for sorting and then O(n) for filtering.
srt = sorted(lst)
res = [x for i, x in enumerate(srt)
if i == len(srt) - 1 or not srt[i+1].startswith(x)]
Both the tree and sorting-approach may scramble the order of elements in the result. If you want the elements in their original order, you can create a set
from the filtered elements and then do another pass over the original list, keeping the elements that are in the set. This will add another O(n):
res_set = set(res)
res = [x for x in lst if x in res_set]
Each way, the result res
is ['abcX1', 'abcX2', 'abcY', 'defX1', 'efg', 'fgh']
. Also, they all will create a new, filtered list, but not modify the original.
I did some more timing analysis. For your list, as expected there is not too much of a difference. Interestingly, the prefix-tree is slowest, probably since the overhead of creating all those dictionaries beats the O(n²) on the short list. Sorting is fastest.
>>> %timeit test.simple(lst)
10000 loops, best of 5: 44.3 µs per loop
>>> %timeit test.prefixtree(lst)
10000 loops, best of 5: 51.6 µs per loop
>>> %timeit test.sorting(lst)
100000 loops, best of 5: 11.2 µs per loop
More importantly, on a list of 1000 words, the prefix-tree -- now much faster than the nested loop -- is still considerably slower than sorting, despite its lower complexity. My explanation for this is that the O(n logn) of the builtin sorted
is implemented in fast C whereas the O(n) of the tree is still slow Python.
>>> %timeit test.simple(words)
1 loop, best of 5: 562 ms per loop
>>> %timeit test.prefixtree(words)
100 loops, best of 5: 8.42 ms per loop
>>> %timeit test.sorting(words)
1000 loops, best of 5: 1.23 ms per loop
1 I added the k factor here as it is more obvious in this algorithm, but of course using startswith
is also not O(1) but O([length of actual prefix])
Upvotes: 5
Reputation: 6206
Try this:
filtered_list = [x for x in simple_list if not any([y.startswith(x) and y != x for y in simple_list])]
Upvotes: 1