Reputation: 57
I have a list of substrings which has about 10000 entries -
substr_ls = ['N_COULT16_1 1', 'S_COULT2', 'XBG_F 1', 'FAIRWY_3', .....]
I have a list of strings which has about 100 entries -
main_str_ls = ['N_COULT16_1 1XF', 'S_COULT2_RT', 'XBG_F TX300 1', 'FAIRWY_34_AG', ....]
As you see, the substrings are not perfect substrings of strings from main_str_ls
. The sequence of alphabets, numbers, etc from substring will have to match the sequence from string for it to be a match. For example - 'XBG_F 1'
is a match with 'XBG_F TX300 1'
because the sequence is a match even though there is a 'TX300'
in the middle of 'XBG_F'
and '1'
What I'm currently doing is using this function -
def is_subsequence(pattern, items_to_use):
items_to_use = (x for x in items_to_use)
return all(any(x == y for y in items_to_use) for x, _ in itertools.groupby(pattern))
from Finding a substring in a jumbled string by iterating over main_str_ls
(contents of main_str_ls
used as items_to_use
) and substr_ls
(contents of substr_ls
used as pattern
) and when I find a match, it breaks the loop and does some stuff. Something like this -
for main_str in main_str_ls:
main_str = main_str.strip()
for substr in substr_ls:
substr = substr.strip()
if is_subsequence(substr, main_str):
**do stuff**
Is there a better way or a pythonic approach for doing this?
Upvotes: 1
Views: 453
Reputation: 1598
One of the diffence between what you need vs the jumbled string question is they are concerned about allowing repeats. I don't think you can use that design directly. Instead, try this link https://www.geeksforgeeks.org/given-two-strings-find-first-string-subsequence-second/
Upvotes: 1