Reputation: 33
Given the index of a word in a text, I need to get the character index. For example, in the text below:
"The cat called other cats."
The index of word "cat" is 1. I need the index of the first character of cat i.e. c which will be 4. I don't know if this is relevant but I am using python-nltk to get the words. Right now the only way I can think of doing this is:
- Get the first character, find the number of words in this piece of text
- Get the first two characters, find the number of words in this piece of text
- Get the first three characters, find the number of words in this piece of text
Repeat until we get to the required word.
But this will be very inefficient. Any ideas will be appreciated.
Upvotes: 1
Views: 1950
Reputation: 59974
Use enumerate()
>>> def obt(phrase, indx):
... word = phrase.split()[indx]
... e = list(enumerate(phrase))
... for i, j in e:
... if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word:
... return i
...
>>> obt("The cat called other cats.", 1)
4
Upvotes: 0
Reputation: 250971
You can use a dict
here:
>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
start, word = dic[word_ind]
ind = word.find(char)
if ind != -1:
return start + ind
...
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21
Upvotes: 1
Reputation:
import re
def char_index(sentence, word_index):
sentence = re.split('(\s)',sentence) #Parentheses keep split characters
return len(''.join(sentence[:word_index*2]))
>>> s = 'The die has been cast'
>>> char_index(s,3) #'been' has index 3 in the list of words
12
>>> s[12]
'b'
>>>
Upvotes: 0