Reputation: 55
I am trying to fine-tune a Bart model from the huggingface transformers framework on a dialogue summarisation task. The Bart model by default takes in the conversations as a monolithic piece of text as the input and takes the summaries as the decoder input while training. I want to explicitly train the model on dialogue speaker and utterance information rather than waiting for the model to implicitly learn them. For this reason, I am extracting the position IDs of the speaker name tokens and their utterance tokens when I send them to the model along with the original input tokens and summary tokens and send them separately. However, the model's data collator/padding automation expects this information to also be the same size as the inputs (I need to disable this behaviour/change the way I am encoding the speaker to utterance mapping).
Please find the code and description for the above issue below: I am using the SAMSum dataset for the dialogue summarisation task. The dataset looks like this
Conversation:
Amanda: I baked cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)
Summary:
Amanda baked cookies and will bring Jerry some tomorrow.
The conversation gets tokenized as:
tokens = [0, 10127, 5219, 35, 38, 17241, 1437, 15269, 4, 1832, 47, 236, 103, 116, 50121, 50118, 39237, 35, 9136, 328, 50121, 50118, 10127, 5219, 35, 38, 581, 836, 47, 3859, 48433, 2]
The explicit speaker-utterance information is encoded as:
[0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0]
Where 1s indicate that tokens[1:3] map to a name "Amanda" and the 2s indicate that tokens[3:16] map to an utterance ": I baked cookies. Do you want some?"
I am trying to send this speaker utterance association information to the forward function in the hopes of adding a loss on the basis of this information. I intend to override the compute_loss method of the Trainer class from huggingface framework to edit the loss after I can successfully relay this explicit information.
I am currently trying the following:
tokenized_dataset_train = train_datasets.map(preprocess_function, batched=True)
where the preprocess_function tokenizes and adds the speaker-utterance information in the form of a key-value pair. tokenized_dataset_train is of the form {'input_ids':[...], 'attention_mask':[...], 'spk_utt_pos':[...], ...}
The preprocess function makes sure that the lengths for each of 'input_ids', 'attention_masks', and 'spk_utt_pos' is the same.
The data_collator from the DataCollatorForSeq2Seq
pads 'input_ids' and 'attention_masks', but also tries to pad 'spk_utt_pos' which gives an error:
Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`spk_utt_pos` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Upon printing the sizes of 'input_ids', 'attention_masks', and 'spk_utt_pos' inside the train loop during the data collation step I found that the sizes of were not the same. Example: (A 32 instance batch)
'input_ids' sizes 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357
'attention_mask' sizes 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357 357
'spk_utt_pos' sizes 285 276 276 321 58 93 77 69 198 266 55 107 85 235 47 280 209 357 86 186 27 52 80 77 85 231 266 237 322 125 251 126
My question is: Is there something wrong with my approach to adding this explicit information to my model? What can be another method to send the speaker-utterance information to my model?
Upvotes: 0
Views: 402
Reputation: 55
I solved this by extending the DataCollatorForSeq2Seq class and overriding the __call__ method in it to also pad my 'spk_utt_pos' list appropriately.
Upvotes: 0