Reputation: 873
I recently ran into a strange behavior while using spacy, which is when I process string,
in case the string is a single string object, I have to use nlp(string),
while I have to use nlp.pipe(a list) for a list made of strings elements.
The example is as below.
string='this is a string to be process by nlp'
doc =['this','is','a','string','list','to','be','processed','by','spacy']
stringprocess= list(nlp(string))
listprocess = list(nlp.pipe(doc))
listprocess
stringprocess
Why is this? I assume this must be something to do with nlp.pipe() behavior which is generator.
What is the reason?
Thank you.
Upvotes: 3
Views: 739
Reputation: 1086
Spacy does this because generators are more efficient. Since generators are consumed only once they are more memory efficient than a list.
According to their documentation instead of processing texts one-by-one and applying nlp
pipeline it processes texts in batches.
Furthermore, you can configure batch size in nlp.pipe
to optimize performance according to your system
Process the texts as a stream using
nlp.pipe
and buffer them in batches, instead of one-by-one. This is usually much more efficient.
If your goal is to process large streams of data using nlp.pipe
it would be much more efficient to write a streamer/generator to produce results as you need them from database/filesystem than loading everything in memory and then processing them one by one.
Upvotes: 3