mer
mer

Reputation: 33

How to re-download tokenizer for huggingface?

I have the exact same problem as https://github.com/huggingface/transformers/issues/11243, except it only does not work in Jupyter lab. It does work in python in my shell. EDIT: It is now not working in shell either after I closed and reopened the shell.

I downloaded the cardiffnlp/twitter-roberta-base-emotion model using:

model_name = "cardiffnlp/twitter-roberta-base-emotion"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

I saved the model with model.save_pretrained(model_name) and now I can't load the tokenizer. If I run:

tokenizer = AutoTokenizer.from_pretrained(model_name)

it gives the error:

OSError: Can't load tokenizer for 'cardiffnlp/twitter-roberta-base-emotion'. Make sure that:

- 'cardiffnlp/twitter-roberta-base-emotion' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure 'cardiffnlp/twitter-roberta-base-emotion' is not a path to a local directory with something else, in that case)

- or 'cardiffnlp/twitter-roberta-base-emotion' is the correct path to a directory containing relevant tokenizer files

Because I saved the model and not the tokenizer yesterday, I can't load the tokenizer anymore. What can I do to fix this? I don't understand how to save the tokenizer if I can't load the tokenizer.

Upvotes: 0

Views: 2783

Answers (1)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15588

The model and tokenizer are two different things yet do share the same location to which you download them. You need to save both the tokenizer and the model. I wrote a simple utility to help.

import typing as t
from loguru import logger
from pathlib import Path
import torch
from transformers import PreTrainedModel
from transformers import PreTrainedTokenizer


class ModelLoader:
    """ModelLoader
    Downloading and Loading Hugging FaceModels
       Download occurs only when model is not located in the local model directory
       If model exists in local directory, load.
    """

    def __init__(
        self,
        model_name: str,
        model_directory: str,
        tokenizer_loader: PreTrainedTokenizer,
        model_loader: PreTrainedModel,
    ):

        self.model_name = Path(model_name)
        self.model_directory = Path(model_directory)
        self.model_loader = model_loader
        self.tokenizer_loader = tokenizer_loader

        self.save_path = self.model_directory / self.model_name

        if not self.save_path.exists():
            logger.debug(f"[+] {self.save_path} does not exit!")
            self.save_path.mkdir(parents=True, exist_ok=True)
            self.__download_model()

        self.tokenizer, self.model = self.__load_model()

    def __repr__(self):
        return f"{self.__class__.__name__}(model={self.save_path})"

    # Download model from HuggingFace
    def __download_model(self) -> None:

        logger.debug(f"[+] Downloading {self.model_name}")
        tokenizer = self.tokenizer_loader.from_pretrained(f"{self.model_name}")
        model = self.model_loader.from_pretrained(f"{self.model_name}")

        logger.debug(f"[+] Saving {self.model_name} to {self.save_path}")
        tokenizer.save_pretrained(f"{self.save_path}")
        model.save_pretrained(f"{self.save_path}")

        logger.debug("[+] Process completed")

    # Load model
    def __load_model(self) -> t.Tuple:

        logger.debug(f"[+] Loading model from {self.save_path}")
        tokenizer = self.tokenizer_loader.from_pretrained(f"{self.save_path}")
        # Check if GPU is available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"[+] Model loaded in {device} complete")
        model = self.model_loader.from_pretrained(f"{self.save_path}").to(device)

        logger.debug("[+] Loading completed")
        return tokenizer, model

    def retrieve(self) -> t.Tuple:

        """Retriver
        Returns:
            Tuple: tokenizer, model
        """
        return self.tokenizer, self.model

You can use it as


…
model_name =  "cardiffnlp/twitter-roberta-base-emotion"
model_directory = "/tmp" # or where you want to store models

tokenizer_loader = AutoTokenizer
model_loader = AutoModelForSequenceClassification


get_model = ModelLoader(model_name=model_name, model_directory=model_directory, tokenizer_loader=tokenizer_loader, model_loader=model_loader)


model, tokenizer = get_model.retrieve()

Upvotes: 0

Related Questions