How to set maximum sentence length in spacy?

Question

I have a string I converted to a spacy Doc. However, when I iterate through the Doc.sents object, I get sentences I found they are too long.

Is there a way when doing doc = nlp(string) to set the maximum length for a single sentence?

Thanks a lot, this would really help.

polm23 · Accepted Answer

No, there is no way to do this.

In normal language, while practically sentences don't get too long, there's no strict limit on the length of a sentence. Imagine a list of all fruits or something.

Partly because of that, it's not clear what to do with overlong sentences. Do you split them into segments of the max length or less? Do you throw them out entirely, or cut off words after the first chunk? The right approach depends on your application.

It should typically be easy to implement the strategy you want on top of the .sents iterator.

To split sentences into a max length or less you can do this:

def my_sents(doc, max_len):
    for sent in doc.sents:
        if len(sent) < max_len:
            yield sent
            continue

        # this is a long one
        offset = 0
        while offset < len(sent):
            yield sent[offset:offset+max_len]
            offset += max_len

However, note that for many applications this isn't useful. If you have a max length for sentences you should really think about why you have it and adjust your approach based on that.

How to set maximum sentence length in spacy?

Answers (1)

Related Questions