Peter
Peter

Reputation: 1128

Spacy Memory Usage Performance Improvements

I have tens of thousands of documents, where each doc is about ~150k characters, ~25k white-space bounded tokens, and ~2k unique tokens. I'm using Spacy to pre-process (stopword removal and lemmatization). The preprocessing depends on token.pos_ and token.lemma_ as shown below.

I learned that I incorrectly implemented Spacy by disabling the tok2vec pipeline component (needed for POS tagging); after fixing that, my memory usage is crazy high. The app hangs then the OOM killer kills my python.

My approach is to feed the docs into nlp.pipe in chunks of 100 and n_process=4. This worked fine until fixing the above bug. The only way the app runs without hanging/OOM killer is to reduce the number of docs I feed into the pipe ~25-50. Reducing n_process to 1 doesn't seem to have an impact. Here's my rough approach:

import spacy
from bs4 import BeautifulSoup
import unidecode
import re

nlp = spacy.load('en_core_web_lg')
nlp.max_length = 5000000
nlp.disable_pipe("parser")
nlp.disable_pipe("ner")
nlp.enable_pipe("senter")

def pre_pre_process(record, synswap=True):
    (doc_id, text) = record

    # partial pre-preprocessing = just strip HTML
    text1 = BeautifulSoup(text, "html.parser").get_text(separator=" ")

    # full pre-preprocessing = do all the pre-preprocessing
    text2 = " ".join(text1.strip().split())
    text2 = unidecode.unidecode(text2)
    text2 = text2.lower()
    
    return (text2, {'doc_id': doc_id, 'strip_html': text1, 'ppp': 'full-ppp'})


def pre_process_text(doc, convert_num=True, lemmatization=True,
                     punctuations=True, remove_num=True, special_chars=True,
                     stop_words=True, short_char=True, remove_edgar_junk=True):
    fully_processed = []
    edgar_jnk_patt = re.compile('(?is)ex-\d+\.?\d*')  # noqa: W605
    edgar_jnk = []

    for token in doc:
        # (token, token.pos_, token.is_stop, token.is_punct, token.lemma_)
        flag = True  # assume every token should be added to the vocab
        edit = token.text
        # remove stop words
        if stop_words is True and token.is_stop and token.pos_ != 'NUM':
            flag = False
        # remove punctuations
        if punctuations is True and (token.pos_ == 'PUNCT' or token.is_punct) and flag is True:
            flag = False
        # remove special characters
        if special_chars is True and token.pos_ == 'SYM' and flag is True:
            flag = False
        # remove numbers
        if remove_num is True and (token.pos_ == 'NUM' or token.text.isnumeric()) and flag is True:
            flag = False
        # remove short tokens
        if short_char is True and len(token) < 3 and flag is True:
            flag = False
        # convert tokens to base form
        elif lemmatization is True and token.lemma_ != "-PRON-" and flag is True:
            edit = token.lemma_
        # remove edgar junk
        if remove_edgar_junk is True:
            if token.i < 10:
                if token.text.endswith(('.htm', '.html')):
                    flag = False
                    edgar_jnk.append(token.lemma)
                elif edgar_jnk_patt.search(token.lemma_):
                    flag = False
                    edgar_jnk.append(token.lemma)
            if token.lemma in edgar_jnk and flag is True:
                flag = False

        # append tokens edited and not removed to list
        if edit != "" and flag is True:
            fully_processed.append(edit)
    return fully_processed

# In the complete script, `data` is queried from a DB limited by a param, `query_limit = 50`. It continues in a while true loop grabbing `query_limit` records until there aren't any more records to query. 

# For reproducibility, `data` sample here: https://gist.github.com/roablep/09731a9a0996fc82aecedb6fcb7c026a

completed_jobs = []
pipeline_texts = [pre_pre_process(d) for d in data]
for doc, context in nlp.pipe(pipeline_texts, as_tuples=True, n_process=4):
    tokens = pre_process_text(doc)
    completed_jobs.append((context, tokens))

My questions are:

  1. Why is tok2vec eating so much memory?
  2. How can I profile what's happening in nlp.pipe?
  3. Is there a better way to implement this pipeline overall?
  4. Is there a better way to implement the pre-processing? (Is there a built-in Spacy approach or is what I have pretty standard)

Related to question 2: Interesting spikiness in memory: enter image description here

Upvotes: 1

Views: 711

Answers (1)

polm23
polm23

Reputation: 15593

spaCy is not really designed to work with 25k word documents (which is like a short novel) as single strings. You should split you documents into some natural sub-unit, like paragraphs, and process them. Note that even if you don't use spaCy, working with documents of that length without splitting them up somehow will be challenging.

Upvotes: 1

Related Questions