Reputation: 111
I want to pre-process text data using spacy (or something else). My code below works but is really slow. I only have a 20 MB zipped text file as a demo and it takes more than 10 minutes to process with my code. The problem is: I'll need to process text files of about 20 GB of zipped text files and want to speed up my algorithms before.
Also, how will I be able to deal with a 20 GB zipped text file? It'll blow my main memory of 16GB if I run the code below. Can I read it line-by-line and still get a good speed?
Any help would be appreciated.
import zipfile
nlp = spacy.load("en_core_web_sm" , n_process=4)
with zipfile.ZipFile(filename, 'r') as thezip:
text=thezip.open(thezip.filelist[0],mode='r').read()
text=text.decode('utf-8').splitlines()
for doc in nlp.pipe(text, disable=["tok2vec", "parser", "attribute_ruler"], batch_size=2000):
# Do something with the doc here
# First remove punctuation
tokens=[t for t in doc if t.text not in string.punctuation]
# then remove stop words, weird unicode characters, words with digits in them
# and empty characters.
tokens = [ t for t in tokens if not t.is_stop and t.is_ascii and not t.is_digit and len(t) > 1 and not any(char.isdigit() for char in t.text)]
# remove empty lines, make it lower case and put them in sentence form
if len(tokens):
sentence= " ".join(token.text.lower() for token in tokens)
# do something useful with sentence here
Upvotes: 1
Views: 994
Reputation: 15593
It looks like you just want to use the spaCy tokenizer? In that case use nlp = spacy.blank("en")
instead of spacy.load
, and then you can leave out the disable
part in nlp.pipe
.
Also to be clear, you're using spaCy v2?
Here's a function that makes your code faster and also cleaner:
def is_ok(tok):
# this is much faster than `not in string.punctuation`
if tok.is_punct: return False
if tok.is_stop: return False
if not tok.is_ascii: return False
if tok.is_digit: return False
if len(tok.text) < 2: return False
# this gets rid of anything with a number in it
if 'd' in tok.shape_: return False
return True
# replace your stuff with this:
toks = [tok for tok in doc if is_ok(tok)]
Reading your zip file one line at a time should be totally fine since you're just using the tokenizer.
Upvotes: 1