Reputation: 122240
Given a
and b
relating to a list of substrings in c
:
a = "how are you ?"
b = "wie gehst's es dir?"
c = [
("how", "wie"),
("are", "gehst's"),
("you", "es")
]
What's the optimal method to get the offsets that produce:
offsets = [
("how", "wie", (0, 3), (0, 3)),
("are", "gehst's", (4, 6), (4, 11)),
("you", "es", (7, 9), (12, 14))
]
From ChatGPT, it suggests the simplistic manner by doing:
To generate the desired offsets from the given strings a and b and the list of substring pairs c, we need to find the starting and ending positions (indices) of each substring from a in a itself, and each substring from b in b itself.
Steps:
a = "how are you ?"
b = "wie gehst's es dir?"
c = [
("how", "wie"),
("are", "gehst's"),
("you", "es")
]
# Create the offsets list
offsets = []
for substring_a, substring_b in c:
# Find the start and end indices for substring_a in string a
start_a = a.find(substring_a)
end_a = start_a + len(substring_a) - 1
# Find the start and end indices for substring_b in string b
start_b = b.find(substring_b)
end_b = start_b + len(substring_b) - 1
# Append the result as a tuple
offsets.append((substring_a, substring_b, (start_a, end_a), (start_b, end_b)))
# Output the result
print(offsets)
But is there something more optimal especially of the terms are repeated? E.g.
a = "how are you ? are you okay ?"
b = "wie gehst's es dir? geht es dir gut "
c = [
("how", "wie"),
("are", "gehst's"),
("you", "es"),
("are", "geht"),
("you", "es"),
("okay", "gut")
]
Upvotes: 2
Views: 213
Reputation: 41905
This approach is similar to your original, and some other solutions provided, but has some self defense to it such that missing terms won't affect further searches:
import pprint
a = "how are you ? are you okay ?"
b = "wie gehst's es dir? geht es dir gut "
c = [
("how", "wie"),
("are", "gehst's"),
("you", "es"),
("are", "geht"),
("you", "es"),
("okay", "gut")
]
# Create the offsets list
offsets = []
start_a = start_b = 0
for substring_a, substring_b in c:
# Find the start and end indices for substring_a in string a
if (hit_a := a.find(substring_a, start_a)) != -1:
start_a = hit_a
end_a = start_a + len(substring_a)
# Find the start and end indices for substring_b in string b
if (hit_b := b.find(substring_b, start_b)) != -1:
start_b = hit_b
end_b = start_b + len(substring_b)
# Append the result as a tuple
offsets.append((substring_a, substring_b, (start_a, end_a), (start_b, end_b)))
start_b = end_b
start_a = end_a
# Output the result
pprint.pprint(offsets)
OUTPUT
% python3 test.py
[('how', 'wie', (0, 3), (0, 3)),
('are', "gehst's", (4, 7), (4, 11)),
('you', 'es', (8, 11), (12, 14)),
('are', 'geht', (14, 17), (21, 25)),
('you', 'es', (18, 21), (26, 28)),
('okay', 'gut', (22, 26), (33, 36))]
%
Upvotes: 1
Reputation: 1
First of all, there is a problem with your output:
offsets = [
("how", "wie", (0, 3), (0, 3)),
("are", "gehst's", (4, 6), (4, 11)),
("you", "es", (7, 9), (12, 14))
]
Should be:
offsets = [
("how", "wie", (0, 3), (0, 3)),
("are", "gehst's", (4, 7), (4, 11)),
("you", "es", (8, 11), (12, 14))
]
You can use this code:
a = "how are you ?"
a_list = a.split()
b = "wie gehst's es dir?"
b_list = b.split()
# This way you can get the word easily by using its position in the sentence
a_values = dict(zip(range(len(a_list)), a_list)) # {index:word}
b_values = dict(zip(range(len(b_list)), b_list))
# Helps you find the position in the sentence by word.
a_keys = dict(zip(a_list, range(len(a_list)))) # {word:index}
b_keys = dict(zip(b_list, range(len(b_list))))
target = 'b' # This corresponds to the sentence. As in the example it is either sentence a or b
target_values = b_values if target == 'b' else a_values
target_keys = b_keys if target == 'b' else a_keys
# Give your arbitary word to x_keys
for value in target_values.values():
index = target_keys[value] # Find the index of the word
start = 0 # Calculate its offset
for i in range(index):
start += len(target_values[i])
start += 1 # Whitespace between words
end = start + len(target_values[index])
print(f'{value}: ({start}, {end})')
Upvotes: 0
Reputation: 107015
As an alternative you can also enclose the substrings in capture groups of a regex pattern so that you can use the start
and end
methods of a Match
object to identify the offset of each substring in a given text. This would be especially useful in the event you'd like to search for a pattern rather than strictly a substring:
import re
from itertools import chain
offsets = []
for text, substrings in zip((a, b), zip(*c)):
match = re.search('.*?'.join(map('({})'.format, substrings)), text)
offsets.append([
(substring, (match.start(group), match.end(group) - 1))
for group, substring in enumerate(match.groups(), 1)
])
offsets = [tuple(chain.from_iterable(zip(*info))) for info in zip(*offsets)]
offsets
becomes:
[('how', 'wie', (0, 2), (0, 2)),
('are', "gehst's", (4, 6), (4, 10)),
('you', 'es', (8, 10), (12, 13)),
('are', 'geht', (14, 16), (21, 24)),
('you', 'es', (18, 20), (26, 27)),
('okay', 'gut', (22, 25), (33, 35))]
Demo here
Upvotes: 1
Reputation: 782158
str.find()
takes optional start
and end
arguments to restrict where it searches for the substring. So you can use the previous end_a
as the start
argument to the next a.find()
.
offsets = []
end_a = end_b = 0
for substring_a, substring_b in c:
# Find the start and end indices for substring_a in string a
start_a = a.find(substring_a, end_a)
end_a = start_a + len(substring_a)
# Find the start and end indices for substring_b in string b
start_b = b.find(substring_b, end_b)
end_b = start_b + len(substring_b)
# Append the result as a tuple
offsets.append((substring_a, substring_b, (start_a, end_a - 1), (start_b, end_b - 1)))
Results:
[('how', 'wie', (0, 2), (0, 2)),
('are', "gehst's", (4, 6), (4, 10)),
('you', 'es', (8, 10), (12, 13)),
('are', 'geht', (14, 16), (21, 24)),
('you', 'es', (18, 20), (26, 27)),
('okay', 'gut', (22, 25), (33, 35))]
Upvotes: 1