Reputation: 43
I use the python library sentence_transformers with the models RoBERTa and FlauBERT. I use cosine scores to compute similarity but for some words it doesn't work well. Those words seems to be the one that are not part of the "known" words from the model (words that weren't on the training set I guess) like : "WCFs", "SARs", "OSGi"
Is there a way to check if a string is "known" by a model ? (with this library or any other one able to load those Transformers model)
Thanks a lot.
Upvotes: 2
Views: 1361
Reputation: 812
For RoBERTa and FlauBERT models, you can use get_vocab()
method to get a dictionary with the tokens and theirs ids. Example of 100 tokens in vocab:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100]
Yields:
['<s>',
'<pad>',
'</s>',
'<unk>',
'.',
'Ġthe',
',',
'Ġto',
'Ġand',
'Ġof',
'Ġa',
'Ġin',
'-',
'Ġfor',
'Ġthat',
'Ġon',
'Ġis',
'âĢ',
"'s",
'Ġwith',
'ĠThe',
'Ġwas',
'Ġ"',
'Ġat',
'Ġit',
'Ġas',
'Ġsaid',
'Ļ',
'Ġbe',
's',
'Ġby',
'Ġfrom',
'Ġare',
'Ġhave',
'Ġhas',
':',
'Ġ(',
'Ġhe',
'ĠI',
'Ġhis',
'Ġwill',
'Ġan',
'Ġthis',
')',
'ĠâĢ',
'Ġnot',
'Ŀ',
'Ġyou',
'ľ',
'Ġtheir',
'Ġor',
'Ġthey',
'Ġwe',
'Ġbut',
'Ġwho',
'Ġmore',
'Ġhad',
'Ġbeen',
'Ġwere',
'Ġabout',
',"',
'Ġwhich',
'Ġup',
'Ġits',
'Ġcan',
'Ġone',
'Ġout',
'Ġalso',
'Ġ$',
'Ġher',
'Ġall',
'Ġafter',
'."',
'/',
'Ġwould',
"'t",
'Ġyear',
'Ġwhen',
'Ġfirst',
'Ġshe',
'Ġtwo',
'Ġover',
'Ġpeople',
'ĠA',
'Ġour',
'ĠIt',
'Ġtime',
'Ġthan',
'Ġinto',
'Ġthere',
't',
'ĠHe',
'Ġnew',
'ĠâĢĶ',
'Ġlast',
'Ġjust',
'ĠIn',
'Ġother',
'Ġso',
'Ġwhat']
Then, you can use in
operator in Python to check if a token belongs in the vocabulary:
"token" in vocab.keys()
Upvotes: 1