Reputation: 105
I want to lookup and compare efficiently the string elements in a list and then remove those which are parts of other string elements in the list (with the same beginning point)
list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]
I intend to get a list which looks like this:
list2 = [ 'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]
In other words, I want to keep the longest string element from those elements which start with the same first characters.
Upvotes: 5
Views: 987
Reputation: 1011
There's a slight ambiguity in the way you worded the question about how you'd want to handle ['a','ab','ac','add']
. I'm assuming you'd want ['ab','ac','add']
.
The below additionally assumes that you don't have any empty strings. That's not a good assumption.
Basically, we're building a tree out of the input values, and only keeping the leaf nodes. This is probably the most complex way to do it. I think it has the potential to be the most efficient, but I'm not certain and that's not what you asked for anyway.
from collections import defaultdict
from itertools import groupby
from typing import Collection, Dict, Generator, Iterable, List, Union
# Exploded is a recursive data type representing a culled list of strings as a tree of character-by-character common prefixes. The leaves are the non-common suffixes.
Exploded = Dict[str, Union["Exploded", str]]
def explode(subject:Iterable[str])->Exploded:
heads_to_tails = defaultdict(list)
for s in subject:
if s:
heads_to_tails[s[0]].append(s[1:])
return {
head: prune_or_follow(tails)
for (head, tails)
in heads_to_tails.items()
}
def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
if 1 < len(tails):
return explode(tails)
else: #we just assume it's not empty.
return tails[0]
def implode(tree: Exploded, prefix :Iterable[str] = ()) -> Generator[str, None, None]:
for (head, continued) in tree.items():
if isinstance(continued, str):
yield ''.join((*prefix, head, continued))
else:
yield from implode(continued, (*prefix, head))
def cull(subject: Iterable[str]) -> Collection[str]:
return list(implode(explode(subject)))
print(cull(['a','ab','ac','add']))
print(cull([ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell']))
print(cull(['a', 'ab', 'ac', 'b', 'add']))
EDIT:
I flattened some of the calls, I hope it's easier to read and reason about this way.
It's bugging me that I can't figure out the run-time complexity of this process. I think it's O(nm) where m is the length of the overlapping prefixes, compared to O(nm log(n)) for a string comparison...
EDIT:
I started this other question in Code Review, hoping someone could help me figure out the complexity a little. Someone there pointed out that the code as written didn't actually work: groupby
is a junk by any sane interpretation of its name. I've swapped out the above code, and it's a little easier to read this way as well.
EDIT:
Ok, I've imported some of the excellent suggestions for CR. At this point I'm pretty sure I'm right about the runtime complexity being better than a sort-based option.
Upvotes: 3
Reputation: 82899
As suggested by John Coleman in comments, you can first sort the sentences and then compare consecutive sentences. If one sentences is a prefix of another, it will appear right before that sentences in the sorted list, so we just have to compare consecutive sentences. To preserve the original order, you can use a set
for quickly looking up the filtered elements.
list1 = ['a boy ran', 'green apples are worse',
'a boy ran towards the mill', ' this is another sentence ',
'a boy ran towards the mill and fell']
srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
if b.startswith(a):
filtered.remove(a)
list2 = [x for x in list1 if x in filtered]
Afterwards, list2
is the following:
['green apples are worse',
' this is another sentence ',
'a boy ran towards the mill and fell']
With O(nlogn) this is considerably faster than comparing all pairs of sentences in O(n²), but if the list is not too long, the much simpler solution by Vicrobot will work just as well.
Upvotes: 3
Reputation: 3988
This is a way you can achieve that:-
list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' , ' this is another sentence ' , 'a boy ran towards the mill and fell']
list2 = []
for i in list1:
bool = True
for j in list1:
if id(i) != id(j) and j.startswith(i): bool = False
if bool: list2.append(i)
>>> list2
['green apples are worse', ' this is another sentence ', 'a boy ran towards the mill and fell']
Upvotes: 3