Reputation: 619
I am using Spacy for NLP in Python. I am trying to use nlp.pipe()
to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, nlp.pipe()
returns an object of the class <generator object pipe at 0x7f28640fefa0>
. How can I get it to return a list of docs, as intended?
import Spacy
nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser'])
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
docs
Upvotes: 9
Views: 13872
Reputation: 808
nlp.pipe returns a generator on purpose! Generators are awesome. They are more memory-friendly in that they let you iterate over a series of objects, but unlike a list, they only evaluate the next object when they need to, rather than all at once.
SpaCy is going to turn those strings into sparse matrices, and they're gonna be big. In fact, spaCy is going to turn those strings into Doc objects, which are honkin' big c structs. If your corpus is big enough, storing it all in one variable (e.g., docs = nlp([doc for doc in matches]
or docs = list(nlp.pipe(matches)
) will be inefficient or even impossible. If you're training on any significant amount of data, this won't be a great idea.
Even if it isn't literally impossible, you can do cool things faster if you use the generator as part of a pipeline instead of just dumping it into a list. If you want to extract only certain information, for example, to create a database column of just the named entities, or just the place names in your data, you wouldn't need to store the whole thing in a list and then do a nested for-loop to get them out.
Moreover, the Doc.spans item (and many others) are generators. Similar kinds of data types show up in gensim as well -- half the challenge of NLP is figuring out how to do this stuff in ways that will scale, so it's worth getting used to more efficient containers. (Plus, you can do cooler things with them!)
The official spaCy starter has some notes on scaling and performance in Chapter 3.
Upvotes: 8
Reputation: 1434
For iterating through docs just do
for item in docs
or do
list_of_docs = list(docs)
Upvotes: 13