had
had

Reputation: 327

How to extract position input-output indeces from huggingface transformer text tokenizator?

I want to solve stress prediction task with pretrained russian bert.

Input data looks like this:

граммов сверху|000100000001000

Zeros mean no stress. Ones represent stress position character.

I want to map it as word -> vowel number index

So it will be like граммов -> 1 сверху -> 1

So, for each token, it should be a linear layer with softmax.

I understand this part, but it's hard for me to deal with text preprocessing because text tokenizator can split a word into subword tokens.

Tokenizator takes an input string and returns tokens like this

bert_tokenizer.encode('граммов сверху')
->
[101, 44505, 26656, 102]

So, how to get position mapping between input chars and words?

The desired output should be like [[0, 7], [8, 14]]

I tried to read docs https://huggingface.co/transformers/main_classes/tokenizer.html

And found convert_ids_to_tokens function It works like

encoded = bert_tokenizer.encode('граммов сверху')
bert_tokenizer.convert_ids_to_tokens(encoded)
->
['[CLS]', 'граммов', 'сверху', '[SEP]']

But I'm not sure how to use original string and stress indices to calculate stress position number for token.

Upvotes: 3

Views: 1176

Answers (1)

had
had

Reputation: 327

Its turned out, tokenizer have return_offsets_mapping param, this solve my problem.

Upvotes: 2

Related Questions