Reputation: 105

Remove string element in a list of strings if the first characters match with another string element in the list

I want to lookup and compare efficiently the string elements in a list and then remove those which are parts of other string elements in the list (with the same beginning point)

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

I intend to get a list which looks like this:

list2 = [  'green apples are worse' , ' this is another sentence ' , 'a boy ran towards the mill and fell',.....]

In other words, I want to keep the longest string element from those elements which start with the same first characters.

Upvotes: 5

Answers (3)

ShapeOfMatter

Reputation: 1011

There's a slight ambiguity in the way you worded the question about how you'd want to handle ['a','ab','ac','add']. I'm assuming you'd want ['ab','ac','add'].

The below additionally assumes that you don't have any empty strings. That's not a good assumption.

Basically, we're building a tree out of the input values, and only keeping the leaf nodes. This is probably the most complex way to do it. I think it has the potential to be the most efficient, but I'm not certain ~~and that's not what you asked for anyway~~.

from collections import defaultdict
from itertools import groupby
from typing import Collection, Dict, Generator, Iterable, List, Union

# Exploded is a recursive data type representing a culled list of strings as a tree of character-by-character common prefixes. The leaves are the non-common suffixes.
Exploded = Dict[str, Union["Exploded", str]]

def explode(subject:Iterable[str])->Exploded:
    heads_to_tails = defaultdict(list)
    for s in subject:
        if s:
            heads_to_tails[s[0]].append(s[1:])
    return {
        head: prune_or_follow(tails)
        for (head, tails)
        in heads_to_tails.items()
    }

def prune_or_follow(tails: List[str]) -> Union[Exploded, str]:
    if 1 < len(tails):
        return explode(tails)
    else: #we just assume it's not empty.
        return tails[0]

def implode(tree: Exploded, prefix :Iterable[str] = ()) -> Generator[str, None, None]:
    for (head, continued) in tree.items():
        if isinstance(continued, str):
            yield ''.join((*prefix, head, continued))
        else:
            yield from implode(continued, (*prefix, head))

def cull(subject: Iterable[str]) -> Collection[str]:
    return list(implode(explode(subject)))

print(cull(['a','ab','ac','add']))
print(cull([ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell']))
print(cull(['a', 'ab', 'ac', 'b', 'add']))

EDIT:
I flattened some of the calls, I hope it's easier to read and reason about this way. It's bugging me that I can't figure out the run-time complexity of this process. I think it's O(nm) where m is the length of the overlapping prefixes, compared to O(nm log(n)) for a string comparison...

EDIT:
I started this other question in Code Review, hoping someone could help me figure out the complexity a little. Someone there pointed out that the code as written didn't actually work: groupby is a junk by any sane interpretation of its name. I've swapped out the above code, and it's a little easier to read this way as well.

EDIT:
Ok, I've imported some of the excellent suggestions for CR. At this point I'm pretty sure I'm right about the runtime complexity being better than a sort-based option.

Upvotes: 3

tobias_k

Reputation: 82899

As suggested by John Coleman in comments, you can first sort the sentences and then compare consecutive sentences. If one sentences is a prefix of another, it will appear right before that sentences in the sorted list, so we just have to compare consecutive sentences. To preserve the original order, you can use a set for quickly looking up the filtered elements.

list1 = ['a boy ran', 'green apples are worse', 
         'a boy ran towards the mill', ' this is another sentence ',
         'a boy ran towards the mill and fell']                                                                

srtd = sorted(list1)
filtered = set(list1)
for a, b in zip(srtd, srtd[1:]):
    if b.startswith(a):
        filtered.remove(a)

list2 = [x for x in list1 if x in filtered]

Afterwards, list2 is the following:

['green apples are worse',
 ' this is another sentence ',
 'a boy ran towards the mill and fell']

With O(nlogn) this is considerably faster than comparing all pairs of sentences in O(n²), but if the list is not too long, the much simpler solution by Vicrobot will work just as well.

Upvotes: 3

Vicrobot

Reputation: 3988

This is a way you can achieve that:-

list1 = [ 'a boy ran' , 'green apples are worse' , 'a boy ran towards the mill' ,  ' this is another sentence ' , 'a boy ran towards the mill and fell']
list2 = []
for i in list1:
    bool = True
    for j in list1:
        if id(i) != id(j) and j.startswith(i): bool = False
    if bool: list2.append(i)
>>> list2
['green apples are worse', ' this is another sentence ', 'a boy ran towards the mill and fell']

Upvotes: 3

Remove string element in a list of strings if the first characters match with another string element in the list

Answers (3)

Related Questions