Tulsi Patro
Tulsi Patro

Reputation: 11

How to get rid of the 'nlp.max_length' limit?

I am trying to do custom NER using spacy for articles; but when I start to train the model I get the error saying, "[E088] Text of length 1021312 exceeds maximum of 1000000...." Tried the following solutions : i. Increasing nlp.max_length = 1500000 ii.Used spacy "en_core_web_lg" and then disabled the irrelevant spacy nlp pipelines. iii. Tried nlp.max_length = len(txt) + 5000 iv. changing the max_lenghth in config file v.nlp.max_length = len(text) vi.splitting the data with '\n' and rejoined with space. (new line character “\n” is used to create a new line.) doc = nlp.make_doc(" ".join(text.split('\n')))

But all in vain.

Upvotes: 1

Views: 2430

Answers (2)

Darryl Oatridge
Darryl Oatridge

Reputation: 1

The way I got round this was to split my text, process it then rejoin as before. In my case the strings are in a list of 1 to n size.

def split_elements(strings, limit):
    result = []
    original_positions = []
    
    for i, string in enumerate(strings):
        if len(string) > limit:
            start = 0
            while start < len(string):
                end = min(start + limit, len(string))
                result.append(string[start:end])
                original_positions.append(i)
                start = end
        else:
            result.append(string)
            original_positions.append(i)
    
    return result, original_positions

def rejoin_elements(split_strings, original_positions):
    result = []
    current_index = -1
    current_string = ""
    
    for i, string in enumerate(split_strings):
        if original_positions[i] != current_index:
            if current_string:
                result.append(current_string)
            current_index = original_positions[i]
            current_string = string
        else:
            current_string += string
    
    if current_string:
        result.append(current_string)
    
    return result

# Example usage
text = ["This is a very long string ... more than .", "Short one."]
limit = 999_999

# split the text into manageable chunks
split_text, original_positions = split_elements(text, limit)

nlp = spacy.load("en_core_web_sm")
for item in split_text:
    doc = nlp(item)
    # do stuff

# if wanted, put the text back to the original
original_text = rejoin_elements(split_text, original_positions)

Upvotes: 0

ChengguiS.
ChengguiS.

Reputation: 81

I use the code below. It worked for me.

    import spacy
    nlp = spacy.load('en_core_web_lg',disable=['parser', 'tagger','ner'])

    nlp.max_length = 1198623

Upvotes: 1

Related Questions