Reputation: 1013

Simplest way to convert char offsets to word offsets

I have a python string and a substring of selected text. The string for example could be

stringy = "the bee buzzed loudly"

I want to select the text "bee buzzed" within this string. I have the character offsets i.e 4-14 for this particular string. Because those are the character level indices that the selected text is between.

What is the simplest way to convert these to word level indices i.e 1-2 because the second and third words are being selected. I have many strings that are labeled like this and I would like to convert the indices simply and efficiently. The data is currently stored ina dictionary like so:

data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

I would like to convert it to this form

data = {"string":"the bee buzzed loudly","start_word":1,"end_word":2}

Thank you!

Upvotes: 1

Answers (3)

Ahmed

Reputation: 924

Try this code, please;

def char_change(dic, start_char, end_char, *arg):
    dic[arg[0]] = start_char
    dic[arg[1]] = end_char


data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

start_char = int(input("Please enter your start character: "))
end_char = int(input("Please enter your end character: "))

char_change(data, start_char, end_char, "start_char", "end_char")

print(data)

Default Dictionary:

data = {"string":"the bee buzzed loudly","start_char":4,"end_char":14}

INPUT

Please enter your start character: 1
Please enter your end character: 2

OUTPUT Dictionary:

{'string': 'the bee buzzed loudly', 'start_char': 1, 'end_char': 2}

Upvotes: -1

Ironkey

Reputation: 2607

Heres a simple list index approach:

# set up data
string  = "the bee buzzed loudly"
words = string[4:14].split(" ") #get words from string using the charachter indices
stringLst = string.split(" ") #split string into words
dictionary = {"string":"", "start_word":0,"end_word":0}


#process
dictionary["string"] = string
dictionary["start_word"] = stringLst.index(words[0]) #index of the first word in words
dictionary["end_word"] = stringLst.index(words[-1]) #index of the last
print(dictionary)

{'string': 'the bee buzzed loudly', 'start_word': 1, 'end_word': 2}

take note that this assumes you're using a chronological order of words inside the string

Upvotes: 2

ygorg

Reputation: 770

It seem like a tokenisation problem. My solution would to use a span tokenizer and then search you substring spans in the spans. So using the nltk library:

import nltk
tokenizer = nltk.tokenize.TreebankWordTokenizer()
# or tokenizer = nltk.tokenize.WhitespaceTokenizer()
stringy = 'the bee buzzed loudly'
sub_b, sub_e = 4, 14  # substring begin and end
[i for i, (b, e) in enumerate(tokenizer.span_tokenize(stringy))
 if b >= sub_b and e <= sub_e]

But this is kind of intricate. tokenizer.span_tokenize(stringy) returns spans for each token/word it identified.

Upvotes: 2

Simplest way to convert char offsets to word offsets

Answers (3)

Related Questions