Tazin Islam
Tazin Islam

Reputation: 51

How do I make a paraphrase generation using BERT/ GPT-2

I am trying hard to understand how to make a paraphrase generation using BERT/GPT-2. I cannot understand how do I make it. Could you please provide me with any resources where I will be able to make a paraphrase generation model? "The input would be a sentence and the output would be a paraphrase of the sentence"

Upvotes: 5

Views: 4250

Answers (3)

PRIYANSHU KUMAR
PRIYANSHU KUMAR

Reputation: 1

whatever , paraphrase you got from this above code that is coming in sequence like your text = "kohli a indian cricketer is " after paraphrasing it is coming in sequence like "kohli is a indian cricketer" mean i want SOV(subject verb object ).

Upvotes: -2

David Dale
David Dale

Reputation: 11444

Here is my recipe for training a paraphraser:

  1. Instead of BERT (encoder only) or GPT (decoder only) use a seq2seq model with both encoder and decoder, such as T5, BART, or Pegasus. I suggest using the multilingual T5 model that was pretrained for 101 languages. If you want to load embeddings for your own language (instead of using all 101), you can follow this recipe.

  2. Find a corpus of paraphrases for your language and domain. For English, ParaNMT, PAWS, and QQP are good candidates. A corpus called Tapaco, extracted from Tatoeba, is a paraphrasing corpus that covers 73 languages, so it is a good starting point if you cannot find a paraphrase corpus for your language.

  3. Fine-tune your model on this corpus. The code can be something like this:

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# use here a backbone model of your choice, e.g. google/mt5-base
backbone_model = 'cointegrated/rut5-base-multitask' 
model = T5ForConditionalGeneration.from_pretrained(backbone_model)
tokenizer = T5Tokenizer.from_pretrained(backbone_model)
model.cuda();
optimizer = torch.optim.Adam(params=[p for p in model.parameters() if p.requires_grad], lr=1e-5)

# todo: load the paraphrasing corpus and define the get_batch function

for i in range(100500):
    xx, yy = get_batch(mult=mult)
    x = tokenizer(xx, return_tensors='pt', padding=True).to(model.device)
    y = tokenizer(yy, return_tensors='pt', padding=True).to(model.device)
    # do not force the model to predict pad tokens
    y.input_ids[y.input_ids==0] = -100
    loss = model(
        input_ids=x.input_ids,
        attention_mask=x.attention_mask,
        labels=y.input_ids,
        decoder_attention_mask=y.attention_mask,
        return_dict=True
    ).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

model.save_pretrained('my_paraphraser')
tokenizer.save_pretrained('my_paraphraser')

A more complete version of this code can be found in this notebook.

After the training, the model can be used in the following way:

from transformers import pipeline
pipe = pipeline(task='text2text-generation', model='my_paraphraser')
print(pipe('Here is your text'))
# [{'generated_text': 'Here is the paraphrase or your text.'}]

If you want your paraphrases to be more diverse, you can control the generation procress using arguments like

print(pipe(
    'Here is your text', 
    encoder_no_repeat_ngram_size=3,  # make output different from input
    do_sample=True,  # randomize
    num_beams=5,  # try more options
    max_length=128,  # longer texts
))

Enjoy!

Upvotes: 8

chandni magoo
chandni magoo

Reputation: 21

you can use T5 paraphrasing for generating paraphrases

Upvotes: 1

Related Questions