Reputation: 65
Please see the following code. After having read in a csv file of 5000 rows, I get the error message:
nlp = spacy.blank("en")
nlp.max_length = 3000000
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "cpu"
}
)
ValueError: [E088] Text of length 2508705 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the
nlp.max_length
limit. The limit is in number of characters, so you can check whether your inputs are too long by checkinglen(text)
.
Is there anyway through which this can be solved?
Thanks in advance!
Upvotes: 1
Views: 477
Reputation: 11484
Setting nlp.max_length
should work in general (up until you run out of memory, at least):
import spacy
nlp = spacy.blank("en")
nlp.max_length = 10_000_000
doc = nlp("a " * 2_000_000)
assert len(doc.text) == 4_000_000
I doubt that the sentence-transformers
model can handle texts of this length, though? In terms of the linguistic annotation it's unlikely to be useful to have single docs that are this long.
Upvotes: 1