Holger
Holger

Reputation: 111

Speed up Spacy processing

I want to pre-process text data using spacy (or something else). My code below works but is really slow. I only have a 20 MB zipped text file as a demo and it takes more than 10 minutes to process with my code. The problem is: I'll need to process text files of about 20 GB of zipped text files and want to speed up my algorithms before.

Also, how will I be able to deal with a 20 GB zipped text file? It'll blow my main memory of 16GB if I run the code below. Can I read it line-by-line and still get a good speed?

Any help would be appreciated.

import zipfile
nlp = spacy.load("en_core_web_sm" , n_process=4)

with zipfile.ZipFile(filename, 'r') as thezip:
  text=thezip.open(thezip.filelist[0],mode='r').read()

text=text.decode('utf-8').splitlines()


for doc in nlp.pipe(text, disable=["tok2vec", "parser",  "attribute_ruler"], batch_size=2000):
    # Do something with the doc here
    # First remove punctuation
    tokens=[t for t in doc if t.text not in string.punctuation]
    # then remove stop words, weird unicode characters, words with digits in them
    # and empty characters. 
    tokens = [ t for t in tokens if not t.is_stop and t.is_ascii and not t.is_digit and len(t) > 1 and not any(char.isdigit() for char in t.text)]
    # remove empty lines, make it lower case and put them in sentence form
    if len(tokens):
      sentence= " ".join(token.text.lower() for token in tokens)
      # do something useful with sentence here

Upvotes: 1

Views: 994

Answers (1)

polm23
polm23

Reputation: 15593

It looks like you just want to use the spaCy tokenizer? In that case use nlp = spacy.blank("en") instead of spacy.load, and then you can leave out the disable part in nlp.pipe.

Also to be clear, you're using spaCy v2?

Here's a function that makes your code faster and also cleaner:

def is_ok(tok):
    # this is much faster than `not in string.punctuation`
    if tok.is_punct: return False
    if tok.is_stop: return False
    if not tok.is_ascii: return False
    if tok.is_digit: return False
    if len(tok.text) < 2: return False
    # this gets rid of anything with a number in it
    if 'd' in tok.shape_: return False
    return True

# replace your stuff with this:
toks = [tok for tok in doc if is_ok(tok)]

Reading your zip file one line at a time should be totally fine since you're just using the tokenizer.

Upvotes: 1

Related Questions