Penguin
Penguin

Reputation: 2401

Loading a tokenizer on huggingface: AttributeError: 'AlbertTokenizer' object has no attribute 'vocab'

I'm trying to load a huggingface model and tokenizer. This normally works really easily (I've done it with a dozen models):

from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = BertForMaskedLM.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

But for some reason I'm getting an error when I'm trying to load this one:

tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=False)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab

I found this question related, but it seems like this was an issue in the git repo itself and not on huggingface. I checked the actual repo where this model is saved on huggingface (link) and it clearly has a vocab file (PubMD-30k-clean.vocab) like the rest of the models I loaded.

Upvotes: 0

Views: 3193

Answers (1)

rbi
rbi

Reputation: 436

There seems to be some issue with the tokenizer. It works, if you remove use_fast parameter or set it true, then you will be able to display the vocab file.

tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ALBERT-xxlarge", use_fast=True)
model = AlbertForMaskedLM.from_pretrained("sultan/BioM-ALBERT-xxlarge")
tokenizer.vocab

Output:

{'intervention': 7062,
 '▁tongue': 6911,
 '▁kit': 8341,
 '▁biosimilar': 26423,
 'bank': 19880,
 '▁diesel': 20349,
 'SOD': 6245,
 'iri': 17739,
....

Upvotes: 3

Related Questions