Reputation: 327
I want to solve stress prediction task with pretrained russian bert.
Input data looks like this:
граммов сверху|000100000001000
Zeros mean no stress. Ones represent stress position character.
I want to map it as word -> vowel number index
So it will be like граммов -> 1 сверху -> 1
So, for each token, it should be a linear layer with softmax.
I understand this part, but it's hard for me to deal with text preprocessing because text tokenizator can split a word into subword tokens.
Tokenizator takes an input string and returns tokens like this
bert_tokenizer.encode('граммов сверху')
->
[101, 44505, 26656, 102]
So, how to get position mapping between input chars and words?
The desired output should be like [[0, 7], [8, 14]]
I tried to read docs https://huggingface.co/transformers/main_classes/tokenizer.html
And found convert_ids_to_tokens function It works like
encoded = bert_tokenizer.encode('граммов сверху')
bert_tokenizer.convert_ids_to_tokens(encoded)
->
['[CLS]', 'граммов', 'сверху', '[SEP]']
But I'm not sure how to use original string and stress indices to calculate stress position number for token.
Upvotes: 3
Views: 1176
Reputation: 327
Its turned out, tokenizer have return_offsets_mapping param, this solve my problem.
Upvotes: 2