How to know if a word belong to a Transformer model?

I use the python library sentence_transformers with the models RoBERTa and FlauBERT. I use cosine scores to compute similarity but for some words it doesn't work well. Those words seems to be the one that are not part of the "known" words from the model (words that weren't on the training set I guess) like : "WCFs", "SARs", "OSGi"

Is there a way to check if a string is "known" by a model ? (with this library or any other one able to load those Transformers model)

Thanks a lot.

Upvotes: 2

Answers (1)

Victor Maricato

Reputation: 812

For RoBERTa and FlauBERT models, you can use get_vocab() method to get a dictionary with the tokens and theirs ids. Example of 100 tokens in vocab:

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100]

Yields:

['<s>',
 '<pad>',
 '</s>',
 '<unk>',
 '.',
 'Ġthe',
 ',',
 'Ġto',
 'Ġand',
 'Ġof',
 'Ġa',
 'Ġin',
 '-',
 'Ġfor',
 'Ġthat',
 'Ġon',
 'Ġis',
 'âĢ',
 "'s",
 'Ġwith',
 'ĠThe',
 'Ġwas',
 'Ġ"',
 'Ġat',
 'Ġit',
 'Ġas',
 'Ġsaid',
 'Ļ',
 'Ġbe',
 's',
 'Ġby',
 'Ġfrom',
 'Ġare',
 'Ġhave',
 'Ġhas',
 ':',
 'Ġ(',
 'Ġhe',
 'ĠI',
 'Ġhis',
 'Ġwill',
 'Ġan',
 'Ġthis',
 ')',
 'ĠâĢ',
 'Ġnot',
 'Ŀ',
 'Ġyou',
 'ľ',
 'Ġtheir',
 'Ġor',
 'Ġthey',
 'Ġwe',
 'Ġbut',
 'Ġwho',
 'Ġmore',
 'Ġhad',
 'Ġbeen',
 'Ġwere',
 'Ġabout',
 ',"',
 'Ġwhich',
 'Ġup',
 'Ġits',
 'Ġcan',
 'Ġone',
 'Ġout',
 'Ġalso',
 'Ġ$',
 'Ġher',
 'Ġall',
 'Ġafter',
 '."',
 '/',
 'Ġwould',
 "'t",
 'Ġyear',
 'Ġwhen',
 'Ġfirst',
 'Ġshe',
 'Ġtwo',
 'Ġover',
 'Ġpeople',
 'ĠA',
 'Ġour',
 'ĠIt',
 'Ġtime',
 'Ġthan',
 'Ġinto',
 'Ġthere',
 't',
 'ĠHe',
 'Ġnew',
 'ĠâĢĶ',
 'Ġlast',
 'Ġjust',
 'ĠIn',
 'Ġother',
 'Ġso',
 'Ġwhat']

Then, you can use in operator in Python to check if a token belongs in the vocabulary:

"token" in vocab.keys()

Upvotes: 1

How to know if a word belong to a Transformer model?

Answers (1)

Related Questions