jbm
jbm

Reputation: 1318

Using Hugginface Transformers and Tokenizers with a fixed vocabulary?

I have a special, non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our "language." I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?


To clarify, our "language" uses prefixes to identify certain types of tokens, which have certain functions in the overall syntax. We want to be able to mask by type during inference, both on input and as part of the selection process, for example, by limiting top-k or top-p sampling to a give type. With a fixed/hand-tuned vocabulary we can be very specific about which ids, or how many ids we need; i.e., we know which tokens are used by each type, so we can mask/filter accordingly. However, with BPE tokenization a given type may be tokenized with any number of tokens, making this process much less straightforward.

The motivation is just to make life easier by fitting into the Huggingface universe a little better, so we can experiment with off-the-shelf models more fluently. We already have this working using the standard BertTokenizer with both GPT2 and RoBERTa, but it would be nice to be able to experiment with different Huggingface models "out of the box," so to speak (using Trainers, Pipelines, etc.). With the BertTokenizer we just load our vocab.txt and we're done, so I wondered whether there would be some way of doing this with the other tokenizers (really, the BPE ones are the only issue, at this point).

It seems to me that being able specify a vocab for any tokenizer would be more straightforward than getting our tokenizer working with other models. Though perhaps a better approach would be to look at streamlining that process? I suppose I could fork and modify AutoTokenizer... ??

Any help much appreciated.

Upvotes: 1

Views: 2362

Answers (2)

jbm
jbm

Reputation: 1318

I haven't been looking into this problem since not long after I posted (and gave up!), so I hadn't seen the answer from @skyzip. That one is basically correct, except for a couple of small tweaks. The version below seems to work, in case anybody needs it:

import json
from pathlib import Path
from typing import Optional, Tuple, Dict, Union

from transformers import PreTrainedTokenizer


class FixedVocabTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab: Union[Dict[str, int], str], max_len: int = None):
        if isinstance(vocab, str):
            vocab_path = Path(vocab)
            with open(vocab_path, 'r') as f:
                self._token_ids = json.load(f)
        else:
            self._token_ids = vocab
            
        self._id_tokens: Dict[int, str] = {value: key for key, value in self._token_ids.items()}
        super().__init__(max_len=max_len)

        # Initialize special tokens for RoBERTa
        self.unk_token = '<unk>'
        self.pad_token = '<pad>'
        self.cls_token = '<s>'
        self.sep_token = '</s>'
        self.mask_token = '<mask>'
        self.unk_token_id = self._token_ids.get(self.unk_token, 0)
        self.pad_token_id = self._token_ids.get(self.pad_token, 1)
        self.cls_token_id = self._token_ids.get(self.cls_token, 2)
        self.sep_token_id = self._token_ids.get(self.sep_token, 3)
        self.mask_token_id = self._token_ids.get(self.mask_token, 4)

    def _tokenize(self, text: str, **kwargs):
        return text.split(' ')

    def _convert_token_to_id(self, token: str) -> int:
        return self._token_ids[token] if token in self._token_ids else self.unk_token_id

    def _convert_id_to_token(self, index: int) -> str:
        return self._id_tokens[index] if index in self._id_tokens else self.unk_token

    def get_vocab(self) -> Dict[str, int]:
        return self._token_ids.copy()

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        if filename_prefix is None:
            filename_prefix = ''
        vocab_path = Path(save_directory, filename_prefix + 'vocab.json')
        with open(vocab_path, 'w') as f:
            json.dump(self._token_ids, f)
        return (str(vocab_path),)

    @property
    def vocab_size(self) -> int:
        return len(self._token_ids)


if __name__ == '__main__':
    # your custom, fixed vocabulary
    custom_vocab = {
        '<unk>': 0,
        'word0': 1,
        'word1': 2,
        'word2': 3,
        'word3': 4,
        'word4': 5,
        'word5': 6,
        '<s>': 7,
        '</s>': 8,
        '<pad>': 9,
        '<mask>': 10
    }
    model_max_len = 8
    
    # Optionally specify the path to a vocab file
    vocab_path = 'path/to/vocab.json'
    
    # You can either pass the custom vocab dictionary or the path to the vocab file
    tokenizer = FixedVocabTokenizer(vocab_path, max_len=model_max_len)
    
    res = tokenizer(
        [
            'word1 word2 word word1 word3',
            'word2 word0 word0 word3 word5 word4 word2 word1 word0'
        ],
        padding=True,
        truncation=True
    )
    # the result should look like something like this
    # res -> BatchEncoding(
    #     data: {
    #         'input_ids': [[2, 3, 0, 2, 4, 9, 9, 9], [3, 1, 1, 4, 6, 5, 3, 2]],
    #         'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]],
    #         ...
    #     },
    #     ...
    # )

    print(res)

I added support for a vocab.json file, as well, since that using a fixed vocab implies that you've had some process to generate one. I also made it RoBERTa-like, in terms of special tokens, just for my current purposes.

Upvotes: 0

skyzip
skyzip

Reputation: 255

As far as I understand the solution below might help you, as you can use this tokenizer, like you would the other pre-trained ones.

As I do not really understand all the inner workings of the tokenizer, I may very well be off with this solution, but hopefully it can help someone.

The main idea is to subclass the PreTrainedTokenizer. This way, you should only override some of the key methods like _tokenize, _convert_token_to_id, etc..., which are more straightforward than implementing a whole new tokenizer.

import json
from pathlib import Path
from typing import Optional, Tuple, Dict

from transformers import PreTrainedTokenizer


class FixedVocabTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab: Dict[str, int], max_len: int = None):
        super().__init__(max_len=max_len)
        self.__token_ids = vocab
        self.__id_tokens: Dict[int, str] = {value: key for key, value in vocab.items()}

    def _tokenize(self, text: str, **kwargs):
        return text.split(' ')

    def _convert_token_to_id(self, token: str) -> int:
        return self.__token_ids[token] if token in self.__token_ids else self.unk_token_id

    def _convert_id_to_token(self, index: int) -> str:
        return self.__id_tokens[index] if index in self.__id_tokens else self.unk_token

    def get_vocab(self) -> Dict[str, int]:
        return self.__token_ids.copy()

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        if filename_prefix is None:
            filename_prefix = ''
        vocab_path = Path(save_directory, filename_prefix + 'vocab.json')
        json.dump(self.__token_ids, open(vocab_path, 'w'))
        return str(vocab_path),

    @property
    def vocab_size(self) -> int:
        return len(self.__token_ids)


if __name__ == '__main__':
    # your custom, fixed vocabulary
    custom_vocab = {
        '[UNK]': 0,
        'word0': 1,
        'word1': 2,
        'word2': 3,
        'word3': 4,
        'word4': 5,
        'word5': 6,
        '[CLS]': 7,
        '[SEP]': 8,
        '[PAD]': 9
    }
    model_max_len = 8
    tokenizer = FixedVocabTokenizer(custom_vocab, max_len=model_max_len)
    # tell your tokenizer about your special tokens
    tokenizer.add_special_tokens({
        'unk_token': '[UNK]',
        'pad_token': '[PAD]',
        'cls_token': '[CLS]',
        'sep_token': '[SEP]'
    })

    res = tokenizer(
        [
            'word1 word2 word word1 word3',
            'word2 word0 word0 word3 word5 word4 word2 word1 word0'
        ],
        padding=True,
        truncation=True
    )
    # the result should look like something like this
    # res -> BatchEncoding(
    #     data: {
    #         'input_ids': [[2, 3, 0, 2, 4, 9, 9, 9], [3, 1, 1, 4, 6, 5, 3, 2]],
    #         'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]],
    #         ...
    #     },
    #     ...
    # )

This is the solution I could come up with, however I could not figure out if you could do something similar with PreTrainedTokenizerFast. So one more note being, that you can only use slow tokenizers using this method.

Upvotes: 1

Related Questions