Timskouten
Timskouten

Reputation: 65

Trying to increase

Please see the following code. After having read in a csv file of 5000 rows, I get the error message:

nlp = spacy.blank("en")
nlp.max_length = 3000000
nlp.add_pipe(
    "text_categorizer", 
    config={
        "data": data, 
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "device": "cpu"
    }
) 

ValueError: [E088] Text of length 2508705 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

Is there anyway through which this can be solved?

Thanks in advance!

Upvotes: 1

Views: 477

Answers (1)

aab
aab

Reputation: 11484

Setting nlp.max_length should work in general (up until you run out of memory, at least):

import spacy
nlp = spacy.blank("en")
nlp.max_length = 10_000_000
doc = nlp("a " * 2_000_000)
assert len(doc.text) == 4_000_000

I doubt that the sentence-transformers model can handle texts of this length, though? In terms of the linguistic annotation it's unlikely to be useful to have single docs that are this long.

Upvotes: 1

Related Questions