alvas
alvas

Reputation: 122240

How to align two string's offset given a list of substrings offsets?

Given a and b relating to a list of substrings in c:

a = "how are you ?"
b = "wie gehst's es dir?"

c = [
 ("how", "wie"), 
 ("are", "gehst's"),
 ("you", "es")
]

What's the optimal method to get the offsets that produce:

offsets = [
 ("how", "wie", (0, 3), (0, 3)), 
 ("are", "gehst's", (4, 6), (4, 11)),
 ("you", "es", (7, 9), (12, 14))
]

From ChatGPT, it suggests the simplistic manner by doing:

To generate the desired offsets from the given strings a and b and the list of substring pairs c, we need to find the starting and ending positions (indices) of each substring from a in a itself, and each substring from b in b itself.

Steps:

a = "how are you ?"
b = "wie gehst's es dir?"
c = [
    ("how", "wie"),
    ("are", "gehst's"),
    ("you", "es")
]

# Create the offsets list
offsets = []
for substring_a, substring_b in c:
    # Find the start and end indices for substring_a in string a
    start_a = a.find(substring_a)
    end_a = start_a + len(substring_a) - 1
    
    # Find the start and end indices for substring_b in string b
    start_b = b.find(substring_b)
    end_b = start_b + len(substring_b) - 1
    
    # Append the result as a tuple
    offsets.append((substring_a, substring_b, (start_a, end_a), (start_b, end_b)))

# Output the result
print(offsets)

But is there something more optimal especially of the terms are repeated? E.g.

a = "how are you ? are you okay ?"
b = "wie gehst's es dir?  geht es dir gut "

c = [
 ("how", "wie"), 
 ("are", "gehst's"),
 ("you", "es"),
 ("are", "geht"), 
 ("you", "es"),
 ("okay", "gut")
]

Upvotes: 2

Views: 213

Answers (4)

cdlane
cdlane

Reputation: 41905

This approach is similar to your original, and some other solutions provided, but has some self defense to it such that missing terms won't affect further searches:

import pprint

a = "how are you ? are you okay ?"
b = "wie gehst's es dir?  geht es dir gut "

c = [
    ("how", "wie"),
    ("are", "gehst's"),
    ("you", "es"),
    ("are", "geht"),
    ("you", "es"),
    ("okay", "gut")
]

# Create the offsets list
offsets = []

start_a = start_b = 0

for substring_a, substring_b in c:
    # Find the start and end indices for substring_a in string a
    if (hit_a := a.find(substring_a, start_a)) != -1:
        start_a = hit_a
        end_a = start_a + len(substring_a)

        # Find the start and end indices for substring_b in string b
        if (hit_b := b.find(substring_b, start_b)) != -1:
            start_b = hit_b
            end_b = start_b + len(substring_b)

            # Append the result as a tuple
            offsets.append((substring_a, substring_b, (start_a, end_a), (start_b, end_b)))

            start_b = end_b

        start_a = end_a

# Output the result
pprint.pprint(offsets)

OUTPUT

% python3 test.py
[('how', 'wie', (0, 3), (0, 3)),
 ('are', "gehst's", (4, 7), (4, 11)),
 ('you', 'es', (8, 11), (12, 14)),
 ('are', 'geht', (14, 17), (21, 25)),
 ('you', 'es', (18, 21), (26, 28)),
 ('okay', 'gut', (22, 26), (33, 36))]
%

Upvotes: 1

Amirhossein Misaghi
Amirhossein Misaghi

Reputation: 1

First of all, there is a problem with your output:

offsets = [
 ("how", "wie", (0, 3), (0, 3)), 
 ("are", "gehst's", (4, 6), (4, 11)),
 ("you", "es", (7, 9), (12, 14))
]

Should be:

offsets = [
 ("how", "wie", (0, 3), (0, 3)), 
 ("are", "gehst's", (4, 7), (4, 11)),
 ("you", "es", (8, 11), (12, 14))
]

You can use this code:

a = "how are you ?"
a_list = a.split()
b = "wie gehst's es dir?"
b_list = b.split()

# This way you can get the word easily by using its position in the sentence
a_values = dict(zip(range(len(a_list)), a_list))  # {index:word}
b_values = dict(zip(range(len(b_list)), b_list))
# Helps you find the position in the sentence by word.
a_keys = dict(zip(a_list, range(len(a_list))))  # {word:index}
b_keys = dict(zip(b_list, range(len(b_list))))


target = 'b' # This corresponds to the sentence. As in the example it is either sentence a or b
target_values = b_values if target == 'b' else a_values
target_keys = b_keys if target == 'b' else a_keys

# Give your arbitary word to x_keys
for value in target_values.values():
    index = target_keys[value]  # Find the index of the word

    start = 0 # Calculate its offset
    for i in range(index):
        start += len(target_values[i])

        start += 1  # Whitespace between words

    end = start + len(target_values[index])

    print(f'{value}: ({start}, {end})')

Upvotes: 0

blhsing
blhsing

Reputation: 107015

As an alternative you can also enclose the substrings in capture groups of a regex pattern so that you can use the start and end methods of a Match object to identify the offset of each substring in a given text. This would be especially useful in the event you'd like to search for a pattern rather than strictly a substring:

import re
from itertools import chain

offsets = []
for text, substrings in zip((a, b), zip(*c)):
    match = re.search('.*?'.join(map('({})'.format, substrings)), text)
    offsets.append([
        (substring, (match.start(group), match.end(group) - 1))
        for group, substring in enumerate(match.groups(), 1)
    ])
offsets = [tuple(chain.from_iterable(zip(*info))) for info in zip(*offsets)]

offsets becomes:

[('how', 'wie', (0, 2), (0, 2)),
 ('are', "gehst's", (4, 6), (4, 10)),
 ('you', 'es', (8, 10), (12, 13)),
 ('are', 'geht', (14, 16), (21, 24)),
 ('you', 'es', (18, 20), (26, 27)),
 ('okay', 'gut', (22, 25), (33, 35))]

Demo here

Upvotes: 1

Barmar
Barmar

Reputation: 782158

str.find() takes optional start and end arguments to restrict where it searches for the substring. So you can use the previous end_a as the start argument to the next a.find().

offsets = []
end_a = end_b = 0

for substring_a, substring_b in c:
    # Find the start and end indices for substring_a in string a
    start_a = a.find(substring_a, end_a)
    end_a = start_a + len(substring_a)
    # Find the start and end indices for substring_b in string b
    start_b = b.find(substring_b, end_b)
    end_b = start_b + len(substring_b)
    # Append the result as a tuple
    offsets.append((substring_a, substring_b, (start_a, end_a - 1), (start_b, end_b - 1)))

Results:

[('how', 'wie', (0, 2), (0, 2)),
 ('are', "gehst's", (4, 6), (4, 10)),
 ('you', 'es', (8, 10), (12, 13)),
 ('are', 'geht', (14, 16), (21, 24)),
 ('you', 'es', (18, 20), (26, 27)),
 ('okay', 'gut', (22, 25), (33, 35))]

Upvotes: 1

Related Questions