BS100
BS100

Reputation: 873

spacy- why nlp() works for single string while nlp.pipe() works fine for a list of strings?

I recently ran into a strange behavior while using spacy, which is when I process string,

in case the string is a single string object, I have to use nlp(string),

while I have to use nlp.pipe(a list) for a list made of strings elements.

The example is as below.

string='this is a string to be process by nlp'

doc =['this','is','a','string','list','to','be','processed','by','spacy']

stringprocess= list(nlp(string))

listprocess = list(nlp.pipe(doc))

listprocess

stringprocess

Why is this? I assume this must be something to do with nlp.pipe() behavior which is generator.

What is the reason?

Thank you.

Upvotes: 3

Views: 739

Answers (1)

Anurag Wagh
Anurag Wagh

Reputation: 1086

Spacy does this because generators are more efficient. Since generators are consumed only once they are more memory efficient than a list.

According to their documentation instead of processing texts one-by-one and applying nlp pipeline it processes texts in batches.

Furthermore, you can configure batch size in nlp.pipe to optimize performance according to your system

Process the texts as a stream using nlp.pipe and buffer them in batches, instead of one-by-one. This is usually much more efficient.

If your goal is to process large streams of data using nlp.pipe it would be much more efficient to write a streamer/generator to produce results as you need them from database/filesystem than loading everything in memory and then processing them one by one.

spacy pipe

Upvotes: 3

Related Questions