Reputation: 1013
I have a python string and a substring of selected text. The string for example could be
stringy = "the bee buzzed loudly"
I want to select the text "bee buzzed" within this string. I have the character offsets i.e 4-14 for this particular string. Because those are the character level indices that the selected text is between.
What is the simplest way to convert these to word level indices i.e 1-2 because the second and third words are being selected. I have many strings that are labeled like this and I would like to convert the indices simply and efficiently. The data is currently stored ina dictionary like so:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
I would like to convert it to this form
data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}
Thank you!
Upvotes: 1
Views: 866
Reputation: 924
Try this code, please;
def char_change(dic, start_char, end_char, *arg):
dic[arg[0]] = start_char
dic[arg[1]] = end_char
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))
char_change(data, start_char, end_char, "start_char", "end_char")
print(data)
Default Dictionary:
data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}
INPUT
Please enter your start character: 1
Please enter your end character: 2
OUTPUT Dictionary:
{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}
Upvotes: -1
Reputation: 2607
Heres a simple list index approach:
# set up data
string = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}
#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)
{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}
take note that this assumes you're using a chronological order of words inside the string
Upvotes: 2
Reputation: 770
It seem like a tokenisation problem. My solution would to use a span tokenizer and then search you substring spans in the spans. So using the nltk library:
import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14 # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
if b >= sub_b and e <= sub_e]
But this is kind of intricate.
tokenizer.span_tokenize(stringy)
returns spans for each token/word it identified.
Upvotes: 2