Speed up Spacy processing

Question

I want to pre-process text data using spacy (or something else). My code below works but is really slow. I only have a 20 MB zipped text file as a demo and it takes more than 10 minutes to process with my code. The problem is: I'll need to process text files of about 20 GB of zipped text files and want to speed up my algorithms before.

Also, how will I be able to deal with a 20 GB zipped text file? It'll blow my main memory of 16GB if I run the code below. Can I read it line-by-line and still get a good speed?

Any help would be appreciated.

import zipfile
nlp = spacy.load("en_core_web_sm" , n_process=4)

with zipfile.ZipFile(filename, 'r') as thezip:
  text=thezip.open(thezip.filelist[0],mode='r').read()

text=text.decode('utf-8').splitlines()


for doc in nlp.pipe(text, disable=["tok2vec", "parser",  "attribute_ruler"], batch_size=2000):
    # Do something with the doc here
    # First remove punctuation
    tokens=[t for t in doc if t.text not in string.punctuation]
    # then remove stop words, weird unicode characters, words with digits in them
    # and empty characters. 
    tokens = [ t for t in tokens if not t.is_stop and t.is_ascii and not t.is_digit and len(t) > 1 and not any(char.isdigit() for char in t.text)]
    # remove empty lines, make it lower case and put them in sentence form
    if len(tokens):
      sentence= " ".join(token.text.lower() for token in tokens)
      # do something useful with sentence here

Speed up Spacy processing

Answers (1)

Related Questions