Reputation: 720
Following spacy's pipeline documentation I have been trying to use nlp.pipe
pattern to speed up my pipeline. What I have found though is that for whatever batch_size
I set there is no speed up compared to a sequential run.
I was wondering if the issue is on my end or if batching doesn't work?
I am testing this behaviour on 30000 texts which are on average 1500 characters long, I have tested batch sizes of 5,50,500,5000 with no avail.
So I timed:
for text in texts:
doc = nlp(text)
VS
doc_gen = nlp.pipe(texts, batch_size, n_threads)
With n_threads -1 & 2
Testing batch size 5, 50, 500, 5000
With texts
containing 30000 documents with an average 1500 char length
My timing results don't show any significant difference between using the pipe pattern and not.
I am running Python 3 with spacy 2.0.12
Upvotes: 4
Views: 5203
Reputation: 399
The batch size is parameter specific to nlp.pipe, and again, a good value depends on the data being worked on. For reasonably long-sized text such as news articles, it makes sense to keep the batch size reasonably small (so that each batch doesn't contain really long texts), so in this case 20 was chosen for the batch size. For other cases (e.g. Tweets) where each document is much shorter in length, a larger batch size can be used.
-Prashanth Rao, https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html#Option-1:-Sequentially-process-DataFrame-column
In addition to including this helpful quote above, the article I've linked above talks about 3 different ways to speed up text preprocessing with SpaCy.
If using a pipeline is not speeding up the process, I'd recommend using the .apply function instead as mentioned below and on that website! I've seen it shorten a process from taking 9+ hours down to taking 47 minutes.
The page linked above provides the following code:
def lemmatize(text):
"""Perform lemmatization and stopword removal in the clean text
Returns a list of lemmas
"""
doc = nlp(text)
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha and tok.text.lower() not in stopwords]
return lemma_list
The resulting lemmas are stored as a list in a separate column preproc as shown below.
%%time
df_preproc['preproc'] = df_preproc['clean'].apply(lemmatize)
df_preproc[['date', 'content', 'preproc']].head(3)
Upvotes: 4