Reputation: 11
I am trying to do custom NER using spacy for articles; but when I start to train the model I get the error saying, "[E088] Text of length 1021312 exceeds maximum of 1000000...." Tried the following solutions : i. Increasing nlp.max_length = 1500000 ii.Used spacy "en_core_web_lg" and then disabled the irrelevant spacy nlp pipelines. iii. Tried nlp.max_length = len(txt) + 5000 iv. changing the max_lenghth in config file v.nlp.max_length = len(text) vi.splitting the data with '\n' and rejoined with space. (new line character “\n” is used to create a new line.) doc = nlp.make_doc(" ".join(text.split('\n')))
But all in vain.
Upvotes: 1
Views: 2430
Reputation: 1
The way I got round this was to split my text, process it then rejoin as before. In my case the strings are in a list of 1 to n size.
def split_elements(strings, limit):
result = []
original_positions = []
for i, string in enumerate(strings):
if len(string) > limit:
start = 0
while start < len(string):
end = min(start + limit, len(string))
result.append(string[start:end])
original_positions.append(i)
start = end
else:
result.append(string)
original_positions.append(i)
return result, original_positions
def rejoin_elements(split_strings, original_positions):
result = []
current_index = -1
current_string = ""
for i, string in enumerate(split_strings):
if original_positions[i] != current_index:
if current_string:
result.append(current_string)
current_index = original_positions[i]
current_string = string
else:
current_string += string
if current_string:
result.append(current_string)
return result
# Example usage
text = ["This is a very long string ... more than .", "Short one."]
limit = 999_999
# split the text into manageable chunks
split_text, original_positions = split_elements(text, limit)
nlp = spacy.load("en_core_web_sm")
for item in split_text:
doc = nlp(item)
# do stuff
# if wanted, put the text back to the original
original_text = rejoin_elements(split_text, original_positions)
Upvotes: 0
Reputation: 81
I use the code below. It worked for me.
import spacy
nlp = spacy.load('en_core_web_lg',disable=['parser', 'tagger','ner'])
nlp.max_length = 1198623
Upvotes: 1