Jonathan
Jonathan

Reputation: 1936

How does sliding window attention work for Mistral7B model without chunking?

I have a very simple tokenizer like this:

%%time
tokenizer = Tokenizer(models.BPE(byte_fallback=True))
trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=["<pad>", "<unk>"], 
                              min_frequency=1500, show_progress=True)

tokenizer.train_from_iterator(seqs, trainer=trainer)

# Wrap in PreTrainedTokenizerFast
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="<unk>",
    pad_token="<pad>",
    max_length=max_length,
    truncation_side="left",
    padding_side="right",
)

And I want to train a custom mistral model with modified parameters like this:

custom_config = MistralConfig(
    vocab_size=160000,
    hidden_size=768,
    intermediate_size=2048,
    num_attention_heads=24,
    num_hidden_layers=24,
    max_position_embeddings=2048,
    sliding_window=1024,
    rms_norm_eps=1e-5,
    use_cache=False,
    pad_token_id=tokenizer.vocab['<pad>'],
    unk_token_id=tokenizer.vocab['<unk>'],
    bos_token_id=None,
    eos_token_id=None,
)
model = MistralForCausalLM(custom_config)
print(model)

I read the sliding window attention paper but I'm still not exactly clear on how the sliding window attention works. So if I didn't do chunking during my tokenizer, i.e. something like return_overflowing_tokens=True in the tokenizer, and I do sliding attention, what happens?

In the first transformer layer, does the sliding attention stride by 1 every time and end up creating max_position_embeddings-1 feature map or something else? I don't see a stride parameter so not sure how to interpret this.

Upvotes: 1

Views: 47

Answers (0)

Related Questions