Pro Q
Pro Q

Reputation: 5048

How to find matches around fixed strings

I'm looking for help finding Python functions that allow me to take a list of strings, such as ["I like ", " and ", " because "] and a single target string, such as "I like lettuce and carrots and onions because I do", and finds all the ways the characters in the target string can be grouped such that each of the strings in the list comes in order.

For example:

solution(["I like ", " and ", " because ", "do"],
         "I like lettuce and carrots and onions because I do")

should return:

[("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"), 
 ("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")]

Notice that in each of the tuples, the strings in the list parameter are there in order, and the function returns each of the possible ways to split up the target string in order to achieve this.

Another example, this time with only one possible way of organizing the characters:

solution(["take ", " to the park"], "take Alice to the park")

should give the result:

[("take ", "Alice", " to the park")]

Here's an example where there is no way to organize the characters correctly:

solution(["I like ", " because ", ""],
         "I don't like cheese because I'm lactose-intolerant")

should give back:

[]

because there is no way to do it. Notice that the "I like " in the first parameter cannot be split up. The target string doesn't have the string "I like " in it, so there's no way it could match.

Here's a final example, again with multiple options:

solution(["I", "want", "or", "done"],
         "I want my sandwich or I want my pizza or salad done")

should return

[("I", " ", "want", " my sandwich ", "or", " I want my pizza or salad ", "done"),
 ("I", " ", "want", " my sandwich or I want my pizza ", "or", " salad ", "done"),
 ("I", " want my sandwich or I", "want", " my pizza ", "or", " salad ", "done")]`

Notice that, again, each of the strings ["I", "want", "or", "done"] is included in each of the tuples, in order, and that the rest of the characters are reordered around those strings in any way possible. The list of all the possible reorderings is what is returned.

Note that it's also assumed that the first string in the list will appear at the start of the target string, and the last string in the list will appear at the end of the target string. (If they don't, the function should return an empty list.)

What Python functions will allow me to do this?

I've tried using regex functions, but it seems to fail in the cases where there's more than one option.

Upvotes: 1

Views: 386

Answers (2)

Pro Q
Pro Q

Reputation: 5048

EDIT: I've since learned some programming tactics and redone my answer to this problem.

To answer my question, you don't need any special functions. If you'd like a version that's relatively easy to code, look below for a different answer. This solution is also less documented compared to the solution below, but it uses dynamic programming and memoization, so it should be faster than the previous solution, and less memory intensive. It also deals with regex characters (such as |) correctly. (The "previous answer" solution below does not.)

def solution(fixed_strings, target_string):
        def get_middle_matches(s, fixed_strings):
            '''
            Gets the fixed strings matches without the first and last first strings
            Example the parameter tuple ("ABCBD", ["B"]) should give back [["A", "B", "CBD"], ["ABC", "B", "D"]]
            '''
    
            # in the form {(s, s_index, fixed_string_index, fixed_character_index): return value of recursive_get_middle_matches called with those parameters}
            lookup = {}
            
            def memoized_get_middle_matches(*args):
                '''memoize the recursive function'''
                try:
                    ans = lookup[args]
                    return ans
                except KeyError:
                    ans = recursive_get_middle_matches(*args)
                    lookup[args] = ans
                    return ans
    
            def recursive_get_middle_matches(s, s_index, fixed_string_index, fixed_character_index):
                '''
                Takes a string, an index into that string, a index into the list of middle fixed strings,
                ...and an index into that middle fixed string.
                
                Returns what fixed_string_matches(s, fixed_strings[fixed_string_index:-1]) would return, and deals with edge cases.
                '''
                
                # base case: there's no fixed strings left to match
                try:
                    fixed_string = fixed_strings[fixed_string_index]
                except IndexError:
                    # we just finished matching the last fixed string, but there's some stuff left over
                    return [[s]]
                    
                # recursive case: we've finished matching a fixed string
                # note that this needs to go before the end of the string base case
                # ...because otherwise the matched fixed string may not be added to the answer,
                # ...since getting to the end of the main string will short-circuit it
                try:
                    fixed_character = fixed_string[fixed_character_index]
                except IndexError:
                    # finished matching this fixed string
                    upper_slice = s_index
                    lower_slice = upper_slice - len(fixed_string)
                    prefix = s[:lower_slice]
                    match = s[lower_slice:upper_slice]
                    postfix = s[upper_slice:]
                    match_ans = [prefix, match]
                    recursive_answers = memoized_get_middle_matches(postfix, 0, fixed_string_index + 1, 0)
                    if fixed_string == '' and s_index < len(s):
                        recursive_skip_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index)
                        return [match_ans + recursive_ans for recursive_ans in recursive_answers] + recursive_skip_answers
                    else:
                        return [match_ans + recursive_ans for recursive_ans in recursive_answers]
                    
    
                # base cases: we've reached the end of the string
                try:
                    character = s[s_index]
                except IndexError:
                    # nothing left to match in the main string
                    if fixed_string_index >= len(fixed_strings):
                        # it completed matching everything it needed to
                        return [[""]]
                    else:
                        # it didn't finish matching everything it needed to
                        return []
    
                # recursive cases: either we match this character or we don't
                character_matched = (character == fixed_character)
                starts_fixed_string = (fixed_character_index == 0)
                if starts_fixed_string:
                    # if this character starts the fixed string, we're still searching for this same fixed string
                    recursive_skip_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index)
    
                if character_matched:
                    recursive_take_answers = memoized_get_middle_matches(s, s_index + 1, fixed_string_index, fixed_character_index + 1)
                    if starts_fixed_string:
                        # we have the option to either take the character as a match, or skip over it
                        return recursive_skip_answers + recursive_take_answers
                    else:
                        # this character is past the start of the fixed string; we can no longer match this fixed string
                        # since we can't match one of the fixed strings, this is a failed path if we don't match this character
                        # thus, we're forced to take this character as a match
                        return recursive_take_answers
                else:
                    if starts_fixed_string:
                        # we can't match it here, so we skip over and continue
                        return recursive_skip_answers
                    else:
                        # this character is past the start of the fixed string; we can no longer match this fixed string
                        # since we can't match one of the fixed strings, there are no possible matches here
                        return []
            
            ## main code
            return memoized_get_middle_matches(s, 0, 0, 0)
            
        ## main code
            
        # doing the one fixed string case first because it happens a lot
        if len(fixed_strings) == 1:
            # if it matches, then there's just that one match, otherwise, there's none.
            if target_string == fixed_strings[0]:
                return [target_string]
            else:
                return []
                
        if len(fixed_strings) == 0:
            # there's no matches because there are no fixed strings
            return []
            
        # separate the first and last from the middle
        first_fixed_string = fixed_strings[0]
        middle_fixed_strings = fixed_strings[1:-1]
        last_fixed_string = fixed_strings[-1]
        prefix = target_string[:len(first_fixed_string)]
        middle = target_string[len(first_fixed_string):len(target_string)-len(last_fixed_string)]
        postfix = target_string[len(target_string)-len(last_fixed_string):]
    
        # make sure the first and last fixed strings match the target string
        # if not, the target string does not match
        if not (prefix == first_fixed_string and postfix == last_fixed_string):
            return []
        else:
            # now, do the check for the middle fixed strings
            return [[prefix] + middle + [postfix] for middle in get_middle_matches(middle, middle_fixed_strings)]
    
    print(solution(["I like ", " and ", " because ", "do"],
                   "I like lettuce and carrots and onions because I do"))
    print([("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"),
           ("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")])
    print()
    
    print(solution(["take ", " to the park"], "take Alice to the park"))
    print([("take ", "Alice", " to the park")])
    print()
    
    # Courtesy of @ktzr
    print(solution(["I like ", " because "],
                   "I don't like cheese because I'm lactose-intolerant"))
    print([])
    print()
    
    print(solution(["I", "want", "or", "done"],
             "I want my sandwich or I want my pizza or salad done"))
    print([("I", " ", "want", " my sandwich ", "or", " I want my pizza or salad ", "done"),
     ("I", " ", "want", " my sandwich or I want my pizza ", "or", " salad ", "done"),
     ("I", " want my sandwich or I", "want", " my pizza ", "or", " salad ", "done")])

Previous answer:

To answer my question, the itertools.product function and regex.finditer with the overlapped parameter were the two key functions to this solution. I figured I'd include my final code in case it helps someone else in a similar situation.

I really care about my code being super readable, so I ended up coding my own solution based on @ktzr's solution. (Thank you!)

My solution uses a couple of weird things.

First, it uses a overlapped parameter, which is only available through the regex module, and must be installed (most likely via pip install regex). Then, include it at the top with import regex as re. This makes it easy to search for overlapped matches in a string.

Second, my solution makes use of an itertools function that isn't explicitly included in the library, which you have to define as such:

import itertools
def itertools_pairwise(iterable):
    '''s -> (s0,s1), (s1,s2), (s2, s3), ...'''
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

This function simply lets me iterate pairwise through a list, making sure each element (except for the first and last) in the list is encountered twice.

With those two things in place, here's my solution:

def solution(fixed_strings, target_string):
    # doing the one fixed string case first because it happens a lot
    if len(fixed_strings) == 1:
        # if it matches, then there's just that one match, otherwise, there's none.
        if target_string == fixed_strings[0]:
            return [target_string]
        else:
            return []

    # make sure the first and last fixed strings match the target string
    # if not, the target string does not match
    if not (target_string.startswith(fixed_strings[0]) and target_string.endswith(fixed_strings[-1])):
        return []

    # get the fixed strings in the middle that it now needs to search for in the middle of the target string
    middle_fixed_strings = fixed_strings[1:-1]

    # where in the target string it found the middle fixed strings.
    # middle_fixed_strings_placements is in the form: [[where it found the 1st middle fixed string], ...]
    # [where it found the xth middle fixed string] is in the form: [(start index, end index), ...]
    middle_fixed_strings_placements = [[match.span() for match in re.finditer(string, target_string, overlapped=True)]
                                       for string in middle_fixed_strings]

    # if any of the fixed strings couldn't be found in the target string, there's no matches
    if [] in middle_fixed_strings_placements:
        return []

    # get all of the possible ways each of the middle strings could be found once within the target string
    all_placements = itertools.product(*middle_fixed_strings_placements)

    # remove the cases where the middle strings overlap or are out of order
    good_placements = [placement for placement in all_placements
                       if not (True in [placement[index][1] > placement[index + 1][0]
                                        for index in range(len(placement) - 1)])]

    # create a list of all the possible final matches
    matches = []
    target_string_len = len(target_string) # cache for later
    # save the start and end spans which are predetermined by their length and placement
    start_span = (0, len(fixed_strings[0]))
    end_span = (target_string_len - len(fixed_strings[-1]), target_string_len)
    for placement in good_placements:
        placement = list(placement)
        # add in the spans for the first and last fixed strings
        # this makes it so each placement is in the form: [1st fixed string span, ..., last fixed string span]
        placement.insert(0, start_span)
        placement.append(end_span)

        # flatten the placements list to get the places where we need to cut up the string.
        # we want to cut the string at the span values to get out the fixed strings
        cuts = [cut for span in placement for cut in span]

        match = []
        # go through the cuts and make them to create the list
        for start_cut, end_cut in itertools_pairwise(cuts):
            match.append(target_string[start_cut:end_cut])
        matches.append(match)

    return matches

Upvotes: 0

ktzr
ktzr

Reputation: 1645

I have a solution, it needs a fair bit of refactoring but it seems to work, I hope this helps, it was quite an interesting problem.

import itertools
import re
from collections import deque


def solution(search_words, search_string):
    found = deque()
    for search_word in search_words:
        found.append([(m.start()) for m in re.compile(search_word).finditer(search_string)])
    if len(found) != len(search_words) or len(found) == 0:
        return []  # no search words or not all words found
    word_positions_lst = [list(i) for i in itertools.product(*found) if sorted(list(i)) == list(i)]

    ret_lst = []
    for word_positions in word_positions_lst:
        split_positions = list(itertools.chain.from_iterable(
            (split_position, split_position + len(search_word))
            for split_position, search_word in zip(word_positions, search_words)))
        last_seach_word = search_string[split_positions[-1]:]
        ret_strs = [search_string[a:b] for a, b in zip(split_positions, split_positions[1:])]
        if last_seach_word:
            ret_strs.append(last_seach_word)
        if len(search_string) == sum(map(len,ret_strs)):
            ret_lst.append(tuple(ret_strs))
    return ret_lst


print(solution(["I like ", " and ", " because ", "do"],
               "I like lettuce and carrots and onions because I do"))
print([("I like ", "lettuce", " and ", "carrots and onions", " because ", "I ", "do"),
       ("I like ", "lettuce and carrots", " and ", "onions", " because ", "I ", "do")])
print()

print(solution(["take ", " to the park"], "take Alice to the park"))
print([("take ", "Alice", " to the park")])
print()

print(solution(["I like ", " because "],
               "I don't like cheese because I'm lactose-intolerant"))
print([])
print()

Outputs:

[('I like ', 'lettuce', ' and ', 'carrots and onions', ' because ', 'I ', 'do'), ('I like ', 'lettuce and carrots', ' and ', 'onions', ' because ', 'I ', 'do')]
[('I like ', 'lettuce', ' and ', 'carrots and onions', ' because ', 'I ', 'do'), ('I like ', 'lettuce and carrots', ' and ', 'onions', ' because ', 'I ', 'do')]

[('take ', 'Alice', ' to the park')]
[('take ', 'Alice', ' to the park')]

[]
[]

[('I', ' ', 'want', ' my sandwich ', 'or', ' I want my pizza or salad ', 'done'), ('I', ' ', 'want', ' my sandwich or I want my pizza ', 'or', ' salad ', 'done'), ('I', ' want my sandwich or I ', 'want', ' my pizza ', 'or', ' salad ', 'done')]
[('I', ' ', 'want', ' my sandwich ', 'or', ' I want my pizza or salad ', 'done'), ('I', ' ', 'want', ' my sandwich or I want my pizza ', 'or', ' salad ', 'done'), ('I', ' want my sandwich or I', 'want', ' my pizza ', 'or', ' salad ', 'done')]

Edit: refactored code to have meaningful variable names.

Edit2: added the last case i forgot about.

Upvotes: 1

Related Questions