Stackuser1908
Stackuser1908

Reputation: 31

Error : Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2])) while training a binary classifier deepset/gbert-base

I am aware of most of the solutions which are discussed here previously regarding the same problem but still I had no luck with those solutions.

I’m trying to implement a binary classifier. I’m using is a customized dataset and having one text column with german text data and the label column has two classes either 0 or 1.

I’m using here the deepset/gbert-base model and number of labels as 2. I have followed the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/4 I’m getting everything similar till the step:

outputs = model(**batch)

I have tried the following work arounds suggested in this forum and other coding forums. Which are mentioned below:

  1. I checked the pytorch version(Suggested by online forums : to update the pytorch version which are below verison 2) and I’m using the following:2.0.0+cu118

  2. The labels are of the float type and does not contain any null value (Suggested by online forums : to check if the data type of labels is float as the model expect it in that format)

  3. Also tried to change the label shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 because the error says the input from the model to the loss function is of size [16,2] and the target size which are labels here are of size [16] . But changing the shape from [0] and [1] to [1,0] for class zero and [0,1] for class 1 also did not solve the problem.

  4. I also tried to implement through Trainer API following the official tutorial of hugging face https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt and tried to customize the loss function from binary_cross_entropy_with_logits to nn.CrossEntropyLoss() . Just tried to change the loss function to see if the code runs but ended up with the same error.

  5. Also tried using different models apart from the above mentioned model. which are:

nlptown/bert-base-multilingual-uncased-sentiment papluca/xlm-roberta-base-language-detection oliverguhr/german-sentiment-bert

But getting the same error.

Code:

from transformers import AutoTokenizer, DataCollatorWithPadding
tokenizer = AutoTokenizer.from_pretrained("deepset/gbert-base")
 
def tokenize_function(examples):
    return tokenizer(examples["text1"], truncation=True)
 
tokenized_datasets = final_dataset_dict.map(tokenize_function, batched=True)
data_collator= DataCollatorWithPadding(tokenizer)
tokenized_datasets = tokenized_datasets.remove_columns(["text1"])
tokenized_datasets["train"].column_names
tokenized_datasets.set_format("torch")
 
from torch.utils.data import DataLoader
 
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size = 8, collate_fn = data_collator)
eval_dataloader = DataLoader(tokenized_datasets["unsupervised"], batch_size = 8, collate_fn = data_collator)
 
for batch in train_dataloader:
  break
print({k: v.shape for k, v in batch.items()})
#print(batch)
 
from transformers import AutoModelForSequenceClassification
checkpoint = "deepset/gbert-base"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels =2)
 
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

After tokenization my data looks like this :

DatasetDict({ train: Dataset({ features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 2512 }) test: Dataset({ features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 1255 }) validation: Dataset({ features: ['text1', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'], num_rows: 1255 }) }) The batch items in the train_dataloader looks like this. {'labels': torch.Size([8]), 'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69])} The detailed error is as follows:

 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
 <ipython-input-36-b84c8f6552ab> in <cell line: 1>()
 ----> 1 outputs = model(**batch)
       2 #print(outputs.shape)
       3 print(outputs.loss, outputs.logits.shape)
 
 4 frames
 /usr/local/lib/python3.9/dist-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
    3161 
    3162     if not (target.size() == input.size()):
 -> 3163         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
    3164 
    3165     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
 
 ValueError: Target size (torch.Size([8])) must be the same as input size (torch.Size([8, 2]))

Any lead from this problem will be very much appreciated.

I expect the output like : enter image description here

Upvotes: 1

Views: 1351

Answers (1)

Stackuser1908
Stackuser1908

Reputation: 31

Changing the label datatype to integer solved the problem.

df['labels'] = df['labels'].astype(int)

Upvotes: 2

Related Questions