Reputation: 1936
I have a very simple tokenizer like this:
%%time
tokenizer = Tokenizer(models.BPE(byte_fallback=True))
trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=["<pad>", "<unk>"],
min_frequency=1500, show_progress=True)
tokenizer.train_from_iterator(seqs, trainer=trainer)
# Wrap in PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
unk_token="<unk>",
pad_token="<pad>",
max_length=max_length,
truncation_side="left",
padding_side="right",
)
And I want to train a custom mistral model with modified parameters like this:
custom_config = MistralConfig(
vocab_size=160000,
hidden_size=768,
intermediate_size=2048,
num_attention_heads=24,
num_hidden_layers=24,
max_position_embeddings=2048,
sliding_window=1024,
rms_norm_eps=1e-5,
use_cache=False,
pad_token_id=tokenizer.vocab['<pad>'],
unk_token_id=tokenizer.vocab['<unk>'],
bos_token_id=None,
eos_token_id=None,
)
model = MistralForCausalLM(custom_config)
print(model)
I read the sliding window attention paper but I'm still not exactly clear on how the sliding window attention works. So if I didn't do chunking during my tokenizer, i.e. something like return_overflowing_tokens=True
in the tokenizer, and I do sliding attention, what happens?
In the first transformer layer, does the sliding attention stride by 1 every time and end up creating max_position_embeddings-1 feature map or something else? I don't see a stride parameter so not sure how to interpret this.
Upvotes: 1
Views: 47