Britt
Britt

Reputation: 581

Speed up SpaCy tokenizer

I am tokenizing tens of thousands of documents using SpaCy. On average it is taking about 5 seconds per document. Any suggestions on how to speed up the tokenizer?

Some additional information:

The following is my code:

from pathlib import Path, PurePath
from time import time

st = time()
nlp = en_core_web_sm.load(disable = ['ner', 'tagger', 'parser', 'textcat'])
p = Path('input_text/').glob('*.txt')
files = ['input_text/' + x.name for x in p if x.is_file()]

#nlp = spacy.load('en-core-web-sm')

stopwords_file = 'stopwords.txt'

def getStopWords():
    f = open(stopwords_file, 'r')
    stopWordsSet = f.read()
    return stopWordsSet

stopWordsSet = getStopWords()
out_file = 'token_results.txt'
for file in files:
    #print (out_file)
    with open(file, encoding="utf8") as f:
        st_doc = time()
        for line in f:

            doc = nlp(line)

            for token in doc:
                if (not token.text.lower() in stopWordsSet
                    and not token.is_punct and not token.is_space and not token.like_num
                    and len(token.shape_)>1):                    

                    tup = (token.text, '|', token.lemma_)

                    appendFile = open(out_file, 'a', encoding="utf-8")
                    appendFile.write(" " + tup[0])
        print((time() -st_doc), 'seconds elasped for', file)
        appendFile.write('\n')
        appendFile.close()
print((time()-st)/60, 'minutes elasped')

Upvotes: 3

Views: 4701

Answers (1)

aab
aab

Reputation: 11484

  1. The main problem: open your output file once and leave it open until the end of your script. Repeatedly closing and reopening and seeking to the end of an ever larger text file is going to be extremely slow.

  2. Read the stopwords into an actual set(). Otherwise you're searching for each token in a long string containing the whole file, which accidentally matches partial words and is much much slower than checking for set membership.

  3. Use nlp.pipe() or for tokenization just nlp.tokenizer.pipe() to speed up the spacy part a bit. With a bunch of short one-sentence documents this doesn't seem to make a huge difference. It is much faster to tokenize one large document rather than treating each line as an individual document, but whether you want to do that depends on how your data is structured. If you're just tokenizing, you can increase the maximum document size (nlp.max_length) if you need to.

texts = f.readlines()
docs = nlp.tokenizer.pipe(texts)

for doc in docs:
    for token in doc:
        ...

Upvotes: 5

Related Questions