Reputation: 461
I have a string I converted to a spacy Doc
. However, when I iterate through the Doc.sents
object, I get sentences I found they are too long.
Is there a way when doing doc = nlp(string)
to set the maximum length for a single sentence?
Thanks a lot, this would really help.
Upvotes: 2
Views: 1521
Reputation: 15593
No, there is no way to do this.
In normal language, while practically sentences don't get too long, there's no strict limit on the length of a sentence. Imagine a list of all fruits or something.
Partly because of that, it's not clear what to do with overlong sentences. Do you split them into segments of the max length or less? Do you throw them out entirely, or cut off words after the first chunk? The right approach depends on your application.
It should typically be easy to implement the strategy you want on top of the .sents
iterator.
To split sentences into a max length or less you can do this:
def my_sents(doc, max_len):
for sent in doc.sents:
if len(sent) < max_len:
yield sent
continue
# this is a long one
offset = 0
while offset < len(sent):
yield sent[offset:offset+max_len]
offset += max_len
However, note that for many applications this isn't useful. If you have a max length for sentences you should really think about why you have it and adjust your approach based on that.
Upvotes: 1