GDev
GDev

Reputation: 33

Get character index from word index in a text

Given the index of a word in a text, I need to get the character index. For example, in the text below:

"The cat called other cats."

The index of word "cat" is 1. I need the index of the first character of cat i.e. c which will be 4. I don't know if this is relevant but I am using python-nltk to get the words. Right now the only way I can think of doing this is:

 - Get the first character, find the number of words in this piece of text
 - Get the first two characters, find the number of words in this piece of text
 - Get the first three characters, find the number of words in this piece of text
 Repeat until we get to the required word.

But this will be very inefficient. Any ideas will be appreciated.

Upvotes: 1

Views: 1950

Answers (3)

TerryA
TerryA

Reputation: 59974

Use enumerate()

>>> def obt(phrase, indx):
...     word = phrase.split()[indx]
...     e = list(enumerate(phrase))
...     for i, j in e:
...             if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word:
...                     return i
... 
>>> obt("The cat called other cats.", 1)
4

Upvotes: 0

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250971

You can use a dict here:

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21

Upvotes: 1

user2134086
user2134086

Reputation:

import re
def char_index(sentence, word_index):
    sentence = re.split('(\s)',sentence) #Parentheses keep split characters
    return len(''.join(sentence[:word_index*2]))

>>> s = 'The die has been cast'
>>> char_index(s,3)    #'been' has index 3 in the list of words
12
>>> s[12]
'b'
>>> 

Upvotes: 0

Related Questions