Reputation: 4365
I'd like to send the text from various PDF's to OpenAI's API. Specifically the Summarize for a 2nd grader or the TL;DR summarization API's.
I can extract the text from PDF's using PyMuPDF
and prepare the OpenAI prompt.
Question: How best to prepare the prompt when the token count is longer than the allowed 2049?
Upvotes: 25
Views: 50971
Reputation: 1075
To process PDF files in manageable chunks, follow this strategy. I’ll explain it using my use case, where I extract information from discharge summaries, clinical notes, lab reports, and similar documents.
Breaking the content into chunks while preserving context:
ExtractText
method to extract text from multiple pages. This will give you a large text string that you can then process in chunks.Indicators, context continuation markers, or summaries of previous chunks:
To inform me that there is more content to process, you can include the following indicators in your prompts:
Here's a sample prompt structure:
"Process chunk X of discharge summary PDF file. Key pages are pages 1-3. Include text from pages 4-6 in this chunk. Context continuation marker: [previous chunk text]."
Example code using AWS Textract and S3
import boto3
import json
# Initialize AWS Textract and S3 clients
textract = boto3.client('textract')
s3 = boto3.client('s3')
# Define the S3 bucket name and key
bucket_name = 'your-bucket-name'
key = 'discharge-summaries.pdf'
# Define the chunk size and chunk number
chunk_size = 2048
chunk_number = 1
# Extract text from the PDF file using AWS Textract
response = teextract.extract_text(S3Object(bucket_name, key))
# Split the text string into manageable chunks
text_chunks = []
for i in range(0, len(response['Blocks']), chunk_size):
chunk_text = response['Blocks'][i:i+chunk_size]['Text']
text_chunks.append(chunk_text)
# Process each chunk
for chunk_text in text_chunks:
# Extract key information from the chunk text
medications = extract_medications(chunk_text)
lab_results = extract_lab_results(chunk_text)
# ...
# Create a prompt to process the chunk
prompt = f"Process chunk {chunk_number} of discharge summary PDF file. Key pages are pages 1-3. Include text from pages 4-6 in this chunk. Context continuation marker: {text_chunks[chunk_number-1]}"
# Process the chunk using your chosen method
# ...
# Increment the chunk number
chunk_number += 1
Note that this example uses AWS Textract to extract text from the PDF file and splits the text string into manageable chunks. You can modify the code to suit your specific requirements and processing method.
Upvotes: 0
Reputation: 1461
One can use the tiktoken
library by OpenAI to count tokens (see also their Cookbook notebook). It's important to know that the max context window of a model (like 8192 tokens) is the amount of input and output tokens combined.
You can use the following function which truncates the input text based on a certain amount of max tokens:
def truncate_tokens(string: str, encoding_name: str, max_length: int = 8192) -> str:
"""Truncates a text string based on max number of tokens."""
encoding = tiktoken.encoding_for_model(encoding_name)
encoded_string = encoding.encode(string)
num_tokens = len(encoded_string)
if num_tokens > max_length:
string = encoding.decode(encoded_string[:max_length])
return string
This then works as follows:
text = "hello world"
text = truncate_tokens(string=text, encoding_name="gpt-3.5-turbo", max_length=8192)
I also recommend this post by Microsoft which shows how to remove messages from a conversation based on the max length: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chatgpt?pivots=programming-language-chat-completions.
Upvotes: 1
Reputation: 2039
I faced the same problem. Here is the strategy I used to send text that is much, much longer than OpenAIs GPT3 token limit.
Depending on the model (Davinci, Curie, etc.) used, requests can use up to 4097 tokens shared between prompt and completion.
If your prompt is 4000 tokens, your completion can be 97 tokens at most. For more information on OpenAI tokens and how to count them, see here.
To ensure that we don’t exceed the maximum length limit for prompt plus completion, we need to ensure that prompt (i.e. your text) and completion (i.e. the summary) put together always fits into the 4097 token boundary.
For that reason we split the entire text into multiple text chunks, summarize each chunk independently and finally merge all summarized chunks using a simple " ".join()
function.
OpenAI has a fixed limit on the number of tokens. However, a token is not the same as a word. Hence, we first need to calculate the maximum number of words we can send to OpenAI. The documentation says:
Given the token-to-word ratio, we can send approximately 2900 words to OpenAI's GPT3 assuming a 5 sentence summary per text chunk.
We can choose from a plethora of strategies to split up the entire text into smaller chunks.
The simplest approach is creating a single list of all words by splitting the entire text on whitespaces, and then creating buckets of words with words evenly distributed across all buckets. The downside is that we are likely to split a sentence half-way through and lose the meaning of the sentence because GPT ends up summarizing the first half of the sentence independently from the second half — ignoring any relations between the two chunks.
Other options include tokenizers such as SentencePiece and spaCy’s sentence splitter. Choosing the later generates the most stable results.
The following example splits the text “My first birthday was great. My 2. was even better.” into a list of two sentences.
python -m spacy download en_core_web_sm
import spacy
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
text = "My first birthday was great. My 2. was even better."
for sentence in nlp(text).sents:
print(sentence.text)
Output
My first birthday was great.
My 2. was even better.
spaCy correctly detected the second sentence instead of splitting it after the “2.”.
Now, let’s write a text_to_chunks
helper function to generate chunks of sentences where each chunk holds at most 2700 words. 2900 words was the initially calculated word limit, but we want to ensure to have enough buffer for words that are longer than 1.33 tokens.
def text_to_chunks(text):
chunks = [[]]
chunk_total_words = 0
sentences = nlp(text)
for sentence in sentences.sents:
chunk_total_words += len(sentence.text.split(" "))
if chunk_total_words > 2700:
chunks.append([])
chunk_total_words = len(sentence.text.split(" "))
chunks[len(chunks)-1].append(sentence.text)
return chunks
An alternative approach to determine the number of tokens of a text was recently introduced by OpenAI. The approach uses tiktoken
and is tailored towards OpenAI's models.
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
number_of_tokens = len(encoding.encode("tiktoken is great!"))
print(number_of_tokens)
Next, we wrap the text summarization logic into a summarize_text function.
def summarize_text(text):
prompt = f"Summarize the following text in 5 sentences:\n{text}"
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
temperature=0.3,
max_tokens=150, # = 112 words
top_p=1,
frequency_penalty=0,
presence_penalty=1
)
return response["choices"][0]["text"]
Our final piece of code looks like this:
chunks = text_to_chunks(one_large_text)
chunk_summaries = []
for chunk in chunks:
chunk_summary = summarize_text(" ".join(chunk))
chunk_summaries.append(chunk_summary)
summary = " ".join(chunk_summaries)
Upvotes: 30
Reputation: 93
I guess I am kind of late to this, but I developed python and javascript libraries to summarize large (above token limit) text using GPT models. Of course it can handle text below token limits as well.
Assuming you are on Python, just use -
>>> from gptsummarizer import summarizer
>>> generator = summarizer.Summarizer(key="put_your_openai_key_here")
>>> summary = generator.getSummary(text="Hello! How are you?")
>>> summary
Two people are exchanging greetings and inquiring about each others wellbeing.
Upvotes: 1
Reputation: 3359
You have to make sure the context length is within the 2049 tokens. So for the prompt, you need to reduce the size.
OpenAI uses GPT-3 which has a context length of 2049, and text needs to fit within that context length.
I am not sure what you meant to sample the text and compress it. But if you meant how to summarize a longer text, then I would suggest you to chunk the text so that it fits within the 2049 tokens and query OpenAI that way.
Upvotes: 3