Reputation: 10236
given a list of strings how can i find overlapping sequences
arr=['iloveapple','banana','ilove','ban']
substring_list=[]
for idx,s in enumerate(arr):
if idx==0:
substring_list.append(s)
else:
if any(s in x for x in substring_list):
continue
else:
substring_list.append(s)
print(substring_list)
This method is very slow when the list grows large (>1000) . Is there a better way to handle this. Or is there a better data structure to store this overlapping sequences
Upvotes: 0
Views: 744
Reputation: 432
If I understand your problem correctly, I think you need suffix trees. They are the most efficient string matching data-structures. You can build on this to figure out which string in the list overlaps with others.
from suffix_trees import STree
arr=['iloveapple','banana','ilove','ban']
suffix_arr = "".join(arr)
st = STree.STree(suffix_arr)
for i, s in enumerate(arr):
overlapped_index = st.find_all(s)
for index in overlapped_index:
print(arr[i], " overlaps ", index, suffix_arr[index:] )
Upvotes: 0
Reputation: 42143
You could try to let string functions work for you:
arr=['iloveapple','banana','ilove','ban']
allStrings = " ".join(arr)
substring_list = [ s for s in arr if len(allStrings.split(s))>2 ]
Joining all the strings together (with a separator) will give you a single pool of text to search into. Knowing that each string is present at least one, splitting the combined strings on any of the keyword should give only two parts if there is only one instance. But if there is an overlap, then the smaller keyword will appear more than once and cause additional splits.
Upvotes: 0
Reputation: 27495
You can do this with a list comprehension and skip the whole list building part.
#make sure to sort the list first
arr = sorted(arr, key = len)
print([s for i, s in enumerate(arr) if all(s not in _ for _ in arr[i + 1:])])
#['iloveapple', 'ilovebanana']
Upvotes: 1
Reputation: 106598
You can add all possible substrings of a given string in arr
to a seen
set, so that you can look up whether or not a new word in arr
is a substring of any previous strings with O(1) time complexity:
seen = set()
substring_list = []
for s in arr:
if s not in seen:
substring_list.append(s)
seen.update({s[i:i + n + 1] for n in range(len(s)) for i in range(len(s) - n)})
substring_list
becomes:
['iloveapple', 'ilovebanana']
Upvotes: 2