Reputation: 2401
I'm trying to load a huggingface
model and tokenizer. This normally works really easily (I've done it with a dozen models):
from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = BertForMaskedLM.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
But for some reason I'm getting an error when I'm trying to load this one:
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=False)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab
I found this question related, but it seems like this was an issue in the git repo itself and not on huggingface
. I checked the actual repo where this model is saved on huggingface (link) and it clearly has a vocab file (PubMD-30k-clean.vocab
) like the rest of the models I loaded.
Upvotes: 0
Views: 3193
Reputation: 436
There seems to be some issue with the tokenizer. It works, if you remove use_fast
parameter or set it true, then you will be able to display the vocab file.
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=True)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab
Output:
{'intervention': 7062,
'▁tongue': 6911,
'▁kit': 8341,
'▁biosimilar': 26423,
'bank': 19880,
'▁diesel': 20349,
'SOD': 6245,
'iri': 17739,
....
Upvotes: 3