
Reputation: 24911

While training BERT variant, getting IndexError: index out of range in self

While training XLMRobertaForSequenceClassification:

xlm_r_model(input_ids = X_train_batch_input_ids
            , attention_mask = X_train_batch_attention_mask
            , return_dict = False

I faced following error:

Traceback (most recent call last):
  File "<string>", line 3, in <module>
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/", line 1218, in forward
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/", line 849, in forward
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/", line 132, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 160, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Below are details:

  1. Creating model

    config = XLMRobertaConfig() 
    config.output_hidden_states = False
    xlm_r_model = XLMRobertaForSequenceClassification(config=config) # device is device(type='cpu')
  2. Tokenizer

    xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
    MAX_TWEET_LEN = 402
    >>> # describing a data frame I have pre populated
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1000 entries, 29639 to 44633
    Data columns (total 2 columns):
    #    Column  Non-Null Count  Dtype 
    ---  ------  --------------  ----- 
    0    text    1000 non-null   object
    1    class   1000 non-null   int64 
    dtypes: int64(1), object(1)
    memory usage: 55.7+ KB
    X_train = xlmr_tokenizer(list(df_1000[:800].text), padding=True, max_length=MAX_TWEET_LEN+5, truncation=True) # +5: a head room for special tokens / separators
    >>> list(map(len,X_train['input_ids']))  # why its 105? shouldn't it be MAX_TWEET_LEN+5 = 407?
    [105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, ...]
    >>> type(train_index) # describing (for clarity) training fold indices I pre populated
    <class 'numpy.ndarray'>
    >>> train_index.size 
    X_train_fold_input_ids = np.array(X_train['input_ids'])[train_index]
    X_train_fold_attention_mask = np.array(X_train['attention_mask'])[train_index]
    >>> i # batch id
    >>> batch_size
    X_train_batch_input_ids = X_train_fold_input_ids[i:i+batch_size]
    X_train_batch_input_ids = torch.tensor(X_train_batch_input_ids,dtype=torch.long).to(device)
    X_train_batch_attention_mask = X_train_fold_attention_mask[i:i+batch_size]
    X_train_batch_attention_mask = torch.tensor(X_train_batch_attention_mask,dtype=torch.long).to(device)
    >>> X_train_batch_input_ids.size()
    torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?
    >>> X_train_batch_attention_mask.size()
    torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?

After this I make the call xlm_r_model(...) as stated at the beginning of this question and ending up with the specified error.

Noticing all these details, I am still not able to get why I am getting the specified error. Where I am doing it wrong?

Upvotes: 0

Views: 7796

Answers (2)


Reputation: 11

I have the same issue and I solved it by replace the model path to huggingface model name. (from "/path/to/local/model" to "bert-base-chinese").

Upvotes: 0


Reputation: 24911

As per this post on github, there can be possibly many reasons for this. Below is the list of reasons summmarised from that post (as of April 24, 2022, note that 2nd and 3rd reasons are not tested):

  1. Mismatching vocabulary size of tokenizer and bert model. This will cause the tokenizer to generate IDs that the model cannot understand. ref
  2. Model and data to exist on different devices (CPUs, GPUs, TPUs) ref
  3. Sequences of length more than 512 (which is max for BERT-like models) ref

In my case, it was the first reason, mismatching vocab size and I have fixed this as follows:

Here is how I fixed this:

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
config = XLMRobertaConfig() 
config.vocab_size = xlmr_tokenizer.vocab_size  # setting both to have same vocab size

Upvotes: 4

Related Questions