alvas
alvas

Reputation: 122042

How can I find the position of the list of substrings from the string?

How can I find the position of the list of substrings from the string?

Given a string:

"The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."

And a list of substring:

['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

Desired output:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
        (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
        (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
        (110, 119), (120, 122), (123, 131), (131, 132)]

Explanation of the output, the first substring "The" can be found using the (start, end) index by using the string s. So from the desired output.

So if we loop through all the tuples of integers from the desired output we'll get back the list of substrings, i.e.

>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

I've tried:

def find_offset(tokens, s):
    index = 0
    offsets = []
    for token in tokens:
        start = s[index:].index(token) + index
        index = start + len(token)
        offsets.append((start, index))
    return offsets

Is there another way to find the position of the list of substrings from the string?

Upvotes: 5

Views: 1651

Answers (4)

curious_bitcoiner
curious_bitcoiner

Reputation: 21

def spans2(text, fragments):
    result = []
    for fragment in fragments:
        found_start = text.index(fragment)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
    return(result)

Test:

txt = "foo man bit"
frag = ["bit","man"]

spans2(txt,frag)

[(8, 11), (4, 7)]

Upvotes: 0

Fuji Komalan
Fuji Komalan

Reputation: 2047

import re

s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']


for token in tokens:
  pattern = re.compile(re.escape(token))
  print(pattern.search(s).span())

RESULT

(0, 3)
(4, 9)
(9, 10)
(11, 16)
(17, 20)
(21, 23)
(24, 34)
(9, 10)
(36, 43)
(44, 46)
(47, 52)
(52, 54)
(55, 60)
(61, 67)
(68, 72)
(73, 75)
(76, 83)
(84, 89)
(90, 98)
(99, 103)
(104, 109)
(110, 119)
(120, 122)
(123, 131)
(131, 132)

Upvotes: 3

Allen Qin
Allen Qin

Reputation: 19947

First solution:

#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]

Second solution to correct the issues in the first solution:

def find_offsets(tokens, s):
    tid = [list(e) for e in tokens]
    i = 0
    for id_token,token in enumerate(tid):
        while (token[0]!=s[i]):            
            i+=1
        tid[id_token] = tuple((i,i+len(token)))
        i+=len(token)

    return tid


find_offsets(tokens, s)
Out[201]: 
[(0, 3),
 (4, 9),
 (9, 10),
 (11, 16),
 (17, 20),
 (21, 23),
 (24, 34),
 (34, 35),
 (36, 43),
 (44, 46),
 (47, 52),
 (52, 54),
 (55, 60),
 (61, 67),
 (68, 72),
 (73, 75),
 (76, 83),
 (84, 89),
 (90, 98),
 (99, 103),
 (104, 109),
 (110, 119),
 (120, 122),
 (123, 131),
 (131, 132)]   

#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]

Upvotes: 5

9000
9000

Reputation: 40894

If we have no idea about the substrings, there's no way except to scan the whole text anew for each of them.

If, as it seems from data, we know that these are sequential fragments of the text, given in the textual order, it's easy to only scan the rest of the text after each match. There's no point to cut the text every time, though.

def spans(text, fragments):
    result = []
    point = 0  # Where we're in the text.
    for fragment in fragments:
        found_start = text.index(fragment, point)
        found_end = found_start + len(fragment)
        result.append((found_start, found_end))
        point = found_end
    return result

Test:

>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]

This assumes that every fragment is present in the text at the right place. Your output format does not provide an example of mismatch reporting. Using .find instead of .index could help that, though only partly.

Upvotes: 1

Related Questions