Reputation: 5169
I have the following code. This code uses the GPT-2 language model from the Transformers library to generate text from a given input text. The input text is split into smaller chunks of 1024 tokens, and then the GPT-2 model is used to generate text for each chunk. The generated text is concatenated to produce the final output text. The HappyTransformer library is used to simplify the generation process by providing a pre-trained model and an interface to generate text with a given prefix and some settings. The GPT-2 model and tokenizer are also saved to a local directory. The output of the code is the generated text for the input text, with corrections for grammar suggested by the prefix "grammar: ".
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from happytransformer import HappyGeneration, GENSettings
import torch
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
save_path = "/home/ubuntu/storage1/various_transformer_models/gpt2"
# save the tokenizer and model to a local directory
tokenizer.save_pretrained(save_path)
model.save_pretrained(save_path)
# Processing
happy_gen = HappyGeneration("GPT-2", "gpt2")
args = GENSettings(num_beams=5, max_length=1024)
mytext = "This sentence has bad grammar. This is a very long sentence that exceeds the maximum length of 512 tokens. Therefore, we need to split it into smaller chunks and process each chunk separately."
prefix = "grammar: "
# Split the text into chunks of maximum length 1024 tokens
max_length = 1024
chunks = [mytext[i:i+max_length] for i in range(0, len(mytext), max_length)]
# Process each chunk separately
results = []
for chunk in chunks:
# Generate outputs for each chunk
result = happy_gen.generate_text(prefix + chunk, args=args)
results.append(result.text)
# Concatenate the results
output_text = " ".join(results)
print(output_text)
But it gives me this error:
RuntimeError: The size of tensor a (1024) must match the size of tensor b (1025) at non-singleton dimension 3
How can I resolve it?
Upvotes: 2
Views: 1240
Reputation: 122228
Lets break down the example code you have a little:
# Split the text into chunks of maximum length 1024 tokens
max_length = 1024
chunks = [mytext[i:i+max_length] for i in range(0, len(mytext), max_length)]
If the goal is to split the mytext
into chunks of 1024 characters, the list comprehension isn't not doing that because the example you have is shorter than 1024 characters.
Most probably you want to split the text into tokens/subwords instead of using the raw character limit, i.e.
input_ids = tokenizer(mytext)['input_ids']
max_length = 50
chunks_ids = [input_ids[i:i+max_length] for i in range(0, len(input_ids), max_length)]
chunks_str = [tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(c)) for c in chunks]
Then you can generate as such:
from tqdm import tqdm
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from happytransformer import HappyGeneration, GENSettings
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Processing
happy_gen = HappyGeneration("GPT-2", "gpt2")
max_length = 50
args = GENSettings(num_beams=2, max_length=max_length)
mytext = "Golden Delicious is a cultivar of apple. The cultivar arose from a chance seedling, possibly a hybrid of Grimes Golden and Golden Reinette. The original tree was found on the family farm of J. M. Mullins in Clay County, West Virginia, and was locally known as Mullins' Yellow Seedling. Mullins sold the tree and propagation rights to Stark Brothers Nurseries and Orchards for $5000, which first marketed it as a companion of their Red Delicious in 1914 (although the two cultivars are not closely related). Golden Delicious is one of the most popular apple cultivars in the United States, popular for eating as well as in salads, apple sauces, and apple pies. Golden Delicious arose from a chance seedling, possibly a hybrid of Grimes Golden and Golden Reinette. The original tree was found on the family farm of J. M. Mullins in Clay County, West Virginia, and was locally known as Mullins' Yellow Seedling. Mullins sold the tree and propagation rights to Stark Brothers Nurseries for $5000, which first marketed it as a companion of their Red Delicious in 1914. In 1943, the New York State Agricultural Experiment Station in Geneva, New York developed the Jonagold apple by cross-breeding Golden Delicious and Jonathan trees. The cultivar was officially released in 1968 and went on to become the leading apple cultivar in Europe.[7] According to the US Apple Association website, as of 2008, Golden Delicious, along with its descendent cultivars Gala, Ginger Gold, Honeycrisp, and Jonagold, were among the fifteen most popular apple cultivars in the United States."
input_ids = tokenizer(mytext)['input_ids']
chunks = [input_ids[i:i+max_length] for i in range(0, len(input_ids), max_length)]
chunks_str = [tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(c)) for c in chunks]
results = [happy_gen.generate_text(f"grammar: {c}", args=args)
for c in tqdm(chunks_str)]
for r in results:
print(r.text)
IMPORTANT NOTE: Though the generation code works, I'm not sure what the grammar:
prefix is trying to achieve and what the model is expected to output. It looks like you're trying to do some grammar correction but do note that GPT2 generate doesn't work as a grammar corrector out of the box!
Upvotes: 1