Reputation: 122042
How can I find the position of the list of substrings from the string?
Given a string:
"The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
And a list of substring:
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
Desired output:
>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
(34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
(68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
(110, 119), (120, 122), (123, 131), (131, 132)]
Explanation of the output, the first substring "The" can be found using the (start, end)
index by using the string s
. So from the desired output.
So if we loop through all the tuples of integers from the desired output we'll get back the list of substrings, i.e.
>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
I've tried:
def find_offset(tokens, s):
index = 0
offsets = []
for token in tokens:
start = s[index:].index(token) + index
index = start + len(token)
offsets.append((start, index))
return offsets
Is there another way to find the position of the list of substrings from the string?
Upvotes: 5
Views: 1651
Reputation: 21
def spans2(text, fragments):
result = []
for fragment in fragments:
found_start = text.index(fragment)
found_end = found_start + len(fragment)
result.append((found_start, found_end))
return(result)
Test:
txt = "foo man bit"
frag = ["bit","man"]
spans2(txt,frag)
[(8, 11), (4, 7)]
Upvotes: 0
Reputation: 2047
import re
s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
for token in tokens:
pattern = re.compile(re.escape(token))
print(pattern.search(s).span())
RESULT
(0, 3)
(4, 9)
(9, 10)
(11, 16)
(17, 20)
(21, 23)
(24, 34)
(9, 10)
(36, 43)
(44, 46)
(47, 52)
(52, 54)
(55, 60)
(61, 67)
(68, 72)
(73, 75)
(76, 83)
(84, 89)
(90, 98)
(99, 103)
(104, 109)
(110, 119)
(120, 122)
(123, 131)
(131, 132)
Upvotes: 3
Reputation: 19947
First solution:
#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]
Second solution to correct the issues in the first solution:
def find_offsets(tokens, s):
tid = [list(e) for e in tokens]
i = 0
for id_token,token in enumerate(tid):
while (token[0]!=s[i]):
i+=1
tid[id_token] = tuple((i,i+len(token)))
i+=len(token)
return tid
find_offsets(tokens, s)
Out[201]:
[(0, 3),
(4, 9),
(9, 10),
(11, 16),
(17, 20),
(21, 23),
(24, 34),
(34, 35),
(36, 43),
(44, 46),
(47, 52),
(52, 54),
(55, 60),
(61, 67),
(68, 72),
(73, 75),
(76, 83),
(84, 89),
(90, 98),
(99, 103),
(104, 109),
(110, 119),
(120, 122),
(123, 131),
(131, 132)]
#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]
Upvotes: 5
Reputation: 40894
If we have no idea about the substrings, there's no way except to scan the whole text anew for each of them.
If, as it seems from data, we know that these are sequential fragments of the text, given in the textual order, it's easy to only scan the rest of the text after each match. There's no point to cut the text every time, though.
def spans(text, fragments):
result = []
point = 0 # Where we're in the text.
for fragment in fragments:
found_start = text.index(fragment, point)
found_end = found_start + len(fragment)
result.append((found_start, found_end))
point = found_end
return result
Test:
>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]
This assumes that every fragment is present in the text at the right place. Your output format does not provide an example of mismatch reporting. Using .find
instead of .index
could help that, though only partly.
Upvotes: 1