Speed up SpaCy tokenizer

Question

I am tokenizing tens of thousands of documents using SpaCy. On average it is taking about 5 seconds per document. Any suggestions on how to speed up the tokenizer?

Some additional information:

Input files are text files with new line characters
Average size of file is about 400KB
Each input file's tokens are written to a new line in an output file (though i can change this if it helps to increase speed)
There are 1655 stopwords
The output file is feed to fasttext

The following is my code:

from pathlib import Path, PurePath
from time import time

st = time()
nlp = en_core_web_sm.load(disable = ['ner', 'tagger', 'parser', 'textcat'])
p = Path('input_text/').glob('*.txt')
files = ['input_text/' + x.name for x in p if x.is_file()]

#nlp = spacy.load('en-core-web-sm')

stopwords_file = 'stopwords.txt'

def getStopWords():
    f = open(stopwords_file, 'r')
    stopWordsSet = f.read()
    return stopWordsSet

stopWordsSet = getStopWords()
out_file = 'token_results.txt'
for file in files:
    #print (out_file)
    with open(file, encoding="utf8") as f:
        st_doc = time()
        for line in f:

            doc = nlp(line)

            for token in doc:
                if (not token.text.lower() in stopWordsSet
                    and not token.is_punct and not token.is_space and not token.like_num
                    and len(token.shape_)>1):                    

                    tup = (token.text, '|', token.lemma_)

                    appendFile = open(out_file, 'a', encoding="utf-8")
                    appendFile.write(" " + tup[0])
        print((time() -st_doc), 'seconds elasped for', file)
        appendFile.write('
')
        appendFile.close()
print((time()-st)/60, 'minutes elasped')

aab · Accepted Answer

The main problem: open your output file once and leave it open until the end of your script. Repeatedly closing and reopening and seeking to the end of an ever larger text file is going to be extremely slow.
Read the stopwords into an actual set(). Otherwise you're searching for each token in a long string containing the whole file, which accidentally matches partial words and is much much slower than checking for set membership.
Use nlp.pipe() or for tokenization just nlp.tokenizer.pipe() to speed up the spacy part a bit. With a bunch of short one-sentence documents this doesn't seem to make a huge difference. It is much faster to tokenize one large document rather than treating each line as an individual document, but whether you want to do that depends on how your data is structured. If you're just tokenizing, you can increase the maximum document size (nlp.max_length) if you need to.

texts = f.readlines()
docs = nlp.tokenizer.pipe(texts)

for doc in docs:
    for token in doc:
        ...

Speed up SpaCy tokenizer

Answers (1)

Related Questions