jason
jason

Reputation: 4439

Python: splitting a long TTS input text into chunks of strings, given character limit

Google text-to-speech (TTS) has a 5000 character limit, while my text is about 50k characters. I need to chunk the string based on a given limit without cutting off the words.

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

How do I chunk this above string into a list of strings that are not over 20 characters without cutting off the words?

I looked at the NLTK library chunking section and didn't see anything there.

Upvotes: 6

Views: 8894

Answers (5)

Asclepius
Asclepius

Reputation: 63363

This problem of chunking TTS inputs is more complicated than merely splitting over words or sentences or even paragraphs. Depending on the language, especially for English, modern neural TTS benefit from having contextual information of the surrounding words, especially the preceding words.

As such, instead of just splitting over words or sentences, a more correct way to split would be:

  1. First at paragraph boundaries.
  2. Then at sentence boundaries in case of paragraphs that are too large.
  3. Then at word boundaries in case of sentences that are too large.
  4. Lastly over characters in case of words that are too large.

After this, any consecutive splits that can be merged should be merged with appropriate separators. Overall, this keeps the chunks more intelligible for the TTS.


Python packages like semantic-text-splitter and semchunk generalize this task of "semantic splitting". Here is a solution using semantic-text-splitter:

from semantic_text_splitter import TextSplitter

def semantic_split(text: str, limit: int) -> list[str]:
    """Return a list of chunks from the given text, splitting it at semantically sensible boundaries while applying the specified character length limit for each chunk."""
    # Ref: https://stackoverflow.com/a/78288960/
    splitter = TextSplitter(limit)
    chunks = splitter.chunks(text)
    return chunks

LIMIT = 50
TEXT = """“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”"""
chunks = semantic_split(TEXT, LIMIT)

# Print chunks:
for num, chunk in enumerate(chunks, start=1):
    print({"#": num, "len": len(chunk), "chunk": chunk})

Output:

{'#': 1, 'len': 46, 'chunk': '“Well, Prince, so Genoa and Lucca are now just'}
{'#': 2, 'len': 34, 'chunk': 'family estates of the Buonapartes.'}
{'#': 3, 'len': 46, 'chunk': 'But I warn you, if you don’t tell me that this'}
{'#': 4, 'len': 50, 'chunk': 'means war, if you still try to defend the infamies'}
{'#': 5, 'len': 44, 'chunk': 'and horrors perpetrated by that Antichrist—I'}
{'#': 6, 'len': 43, 'chunk': 'really believe he is Antichrist—I will have'}
{'#': 7, 'len': 49, 'chunk': 'nothing more to do with you and you are no longer'}
{'#': 8, 'len': 48, 'chunk': 'my friend, no longer my ‘faithful slave,’ as you'}
{'#': 9, 'len': 33, 'chunk': 'call yourself! But how do you do?'}
{'#': 10, 'len': 48, 'chunk': 'I see I have frightened you—sit down and tell me'}
{'#': 11, 'len': 14, 'chunk': 'all the news.”'}

Upvotes: 1

Parand
Parand

Reputation: 106310

Building on Mark's answer, it looks like there's a small bug in the code when dealing with the end of the search, something like this might work:

    def text_to_chunks(s, maxlength):
        start = 0
        end   = 0
        while start + maxlength  < len(s) and end != -1:
            end = s.rfind(" ", start, start + maxlength + 1)
            if end == -1: break
            yield s[start:end]
            start = end +1
        yield s[start:]

Upvotes: 0

Mark
Mark

Reputation: 92440

This is a similar idea to Green Cloak Guy, but uses a generator rather than creating a list. This should be a little more memory-friendly with large texts and will allow you to iterate over the chunks lazily. You can turn it into a list with list() or use is anywhere an iterator is expected:

s = "Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."

def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(s, 25)

#Make list with line lengths:
[(n, len(n)) for n in chunks]

results

[('Well, Prince, so Genoa', 22),
 ('and Lucca are now just', 22),
 ('family estates of the', 21),
 ('Buonapartes. But I warn', 23),
 ('you, if you don’t tell me', 25),
 ('that this means war, if', 23),
 ('you still try to defend', 23),
 ('the infamies and horrors', 24),
 ('perpetrated by that', 19),
 ('Antichrist—I really', 19),
 ('believe he is', 13),
 ('Antichrist—I will have', 22),
 ('nothing more to do with', 23),
 ('you and you are no longer', 25),
 ('my friend, no longer my', 23),
 ('‘faithful slave,’ as you', 24),
 ('call yourself! But how do', 25),
 ('you do? I see I have', 20),
 ('frightened you—sit down', 23),
 ('and tell me all the news.', 25)]

Upvotes: 7

Green Cloak Guy
Green Cloak Guy

Reputation: 24691

A base-python approach would look 20 characters ahead, find the last bit of whitespace possible, and cut the line there. This isn't an incredibly elegant implementation of that, but it should do the job:

orig_string = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”
list_of_lines = []
max_length = 20
while len(orig_string) > max_length:
    line_length = orig_string[:max_length].rfind(' ')
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)

Upvotes: 6

sheth7
sheth7

Reputation: 349

You can use the nltk.tokenize methods as follows:

import nltk

corpus = '''
Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” 
'''

tokens = nltk.tokenize.word_tokenize(corpus)

or

sent_tokens = nltk.tokenize.sent_tokenize(corpus)

Upvotes: -2

Related Questions