Rohit
Rohit

Reputation: 10236

overlapping sequences in a list

given a list of strings how can i find overlapping sequences

arr=['iloveapple','banana','ilove','ban']
substring_list=[]
for idx,s in enumerate(arr):
    if idx==0:
        substring_list.append(s)
    else:
        if any(s in x for x in substring_list):
            continue
        else:
            substring_list.append(s)


print(substring_list)

This method is very slow when the list grows large (>1000) . Is there a better way to handle this. Or is there a better data structure to store this overlapping sequences

Upvotes: 0

Views: 744

Answers (4)

Pari Rajaram
Pari Rajaram

Reputation: 432

If I understand your problem correctly, I think you need suffix trees. They are the most efficient string matching data-structures. You can build on this to figure out which string in the list overlaps with others.

from suffix_trees import STree

arr=['iloveapple','banana','ilove','ban']

suffix_arr = "".join(arr)

st = STree.STree(suffix_arr)

for i, s in enumerate(arr):
    overlapped_index = st.find_all(s)
    for index in overlapped_index:
        print(arr[i], " overlaps ", index,  suffix_arr[index:] )

Upvotes: 0

Alain T.
Alain T.

Reputation: 42143

You could try to let string functions work for you:

  arr=['iloveapple','banana','ilove','ban']
  allStrings = " ".join(arr)
  substring_list = [ s for s in arr if len(allStrings.split(s))>2 ]

Joining all the strings together (with a separator) will give you a single pool of text to search into. Knowing that each string is present at least one, splitting the combined strings on any of the keyword should give only two parts if there is only one instance. But if there is an overlap, then the smaller keyword will appear more than once and cause additional splits.

Upvotes: 0

Jab
Jab

Reputation: 27495

You can do this with a list comprehension and skip the whole list building part.

#make sure to sort the list first
arr = sorted(arr, key = len)
print([s for i, s in enumerate(arr) if all(s not in _ for _ in arr[i + 1:])])
#['iloveapple', 'ilovebanana']

Upvotes: 1

blhsing
blhsing

Reputation: 106598

You can add all possible substrings of a given string in arr to a seen set, so that you can look up whether or not a new word in arr is a substring of any previous strings with O(1) time complexity:

seen = set()
substring_list = []
for s in arr:
    if s not in seen:
        substring_list.append(s)
    seen.update({s[i:i + n + 1] for n in range(len(s)) for i in range(len(s) - n)})

substring_list becomes:

['iloveapple', 'ilovebanana']

Upvotes: 2

Related Questions