yoni349
yoni349

Reputation: 117

What's your approach to extracting a summarized paragraph from multiple articles using GPT-3?

In the following scenario, what's your best approach using GPT-3 API?

  1. You need to come out with a short paragraph, about a specific subject
  2. You must base your paragraph on a set of articles, 3-6 articles, written in an unknown structure

Here is what I found to work well:

  1. The main constraint is the open ai token limit in the prompt
  2. Due to the constraint, I'd ask OPT-3 to parse unstructured data using the specific subject in the prompt request.
  3. I'll then iterate each article and save it all into 1 string variable
  4. Then, repeat it one last time but using the new string variable
  5. If the article is too long, I'll cut it into smaller chunks
  6. Of curse fine-tune, the model with the specific subject before will produce much better results
  7. The temperature should be set to 0, to make sure GPT-3 uses only facts from the data source.

Example: Let's say I want to write a paragraph about Subject A, Subject B, and Subject C. And I have 5 articles as references. The open ai playground will look something like this:

Example Article 1
----
Subject A: example A for OPT-3
Subject B: n/a
Subject c: n/a
=========
Example Article 2
----
Subject A: n/a
Subject B: example B for GPT-3
Subject C: n/a
=========
Example Article 3
----
Subject A: n/a
Subject B: n/a
Subject c: example for GPT-3
=========
Article 1
-----
Subject A:
Subject B:
Subject C:
=========
... repeating with all articles, save to str
=========
str
-----
Subject A:
Subject B:
Subject C:

Upvotes: 1

Views: 501

Answers (2)

kawai_weebster
kawai_weebster

Reputation: 151

Okay here is the approach that I've tried First I take all the articles and do some pre processing on them. This pre processing does remove some unwanted things in our article thus reducing our tokens.And then I would calculate the number of tokens in that string. I would suggest to keep a max token length of 3500 tokens even though the limit is 4097 because the tokens taken into account are the prompt, your content and also the response so 3500 would give you some buffer . And if the given strings token length exceeds 3500 I would split it into chunks and pass it to the open Api , (I would be careful to not pass these chunks inside a loop since there is cost involved)And generate a summary for each chunk and concat the generated summaries and pass it to the API to generate a final summary.(when splitting into chunks see to that split it where the last chunk does not have tokens less than 100 tokens for better accuracy)

Upvotes: 1

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83177

One may use the Python library GPT Index (MIT license) to summarize a collection of documents. From the documentation:

index = GPTTreeIndex(documents)
response = index.query("<summarization_query>", mode="summarize")

The “default” mode for a tree-based query is traversing from the top of the graph down to leaf nodes. For summarization purposes we will want to use mode="summarize".

 A summarization query could look like one of the following:

  • “What is a summary of this collection of text?”
  • “Give me a summary of person X’s experience with the company.”

Upvotes: 1

Related Questions