Reputation: 19664
I would expect summarization tasks to generally assume long documents. However, following documentation here, any of the simple summarization invocations I make say my documents are too long:
>>> summarizer = pipeline("summarization")
>>> summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5620 > 1024). Running this sequence through the model will result in indexing errors
>>> summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (8084 > 1024). Running this sequence through the model will result in indexing errors
>>> summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5971 > 512). Running this sequence through the model will result in indexing errors
What model or configuration choice makes this most automatic? I've read other questions suggesting manually chunking the data or truncation, but the choice of boundaries and chunk length seem like they would make a difference in summaries. What's the best practice for an arbitrary long document? (Unbounded would be great, but let's say 50,000 tokens at a minimum.)
Upvotes: 4
Views: 5669
Reputation: 856
Langchain has an inbuilt solution for this. It splits long text into chunks. Now you can summarize each chunks using your summarizer, combine them and repeat the process.
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=1024, chunk_overlap=50
)
chunks = text_splitter.split_text(long_text)
one advanced idea on top of this is to balance chunk size,
Eg: text contains 1400 tokens, so better to split it [~700, ~700], than [1024, ~300]. So find the suitable chunk size in each step and pass it as a parameter to CharacterTextSplitter
Upvotes: 3
Reputation: 2368
I am assuming a minimum token length of 50k means that you are trying to summarize something as big as a novel. Unfortunately, we are yet to have a model that can process that much of data at once. This is mostly because the memory footprint of such models will be so high to use in production. But pegasus(google), Longformer, Reformer are all viable options for summarizing long documents. Research is still going on for creating models that can process larger sequences without consuming a lot of resources. For example reformer itself is highly optimized to handle a large number of tokens https://huggingface.co/blog/reformer. By far the best practice is "Divide and Conquer" approach. ie, to chunk your data keeping the maximum length as a reference. You may even do it in iteration until you reach the specified summary length. You may also explore different methods of summarization such as extractive and abstractive summarization, and use your creativity in combining those techniques such as extractive summarization followed by abstractive.
Upvotes: 11