Chris C
Chris C

Reputation: 619

Spacy - nlp.pipe() returns generator

I am using Spacy for NLP in Python. I am trying to use nlp.pipe() to generate a list of Spacy doc objects, which I can then analyze. Oddly enough, nlp.pipe() returns an object of the class <generator object pipe at 0x7f28640fefa0>. How can I get it to return a list of docs, as intended?

import Spacy
nlp = spacy.load('en_depent_web_md', disable=['tagging', 'parser'])
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
docs

Upvotes: 9

Views: 13872

Answers (3)

Tom Wattley
Tom Wattley

Reputation: 519

You can just add

docs = list(nlp.pipe(matches))

Upvotes: 0

Ray Johns
Ray Johns

Reputation: 808

nlp.pipe returns a generator on purpose! Generators are awesome. They are more memory-friendly in that they let you iterate over a series of objects, but unlike a list, they only evaluate the next object when they need to, rather than all at once.

SpaCy is going to turn those strings into sparse matrices, and they're gonna be big. In fact, spaCy is going to turn those strings into Doc objects, which are honkin' big c structs. If your corpus is big enough, storing it all in one variable (e.g., docs = nlp([doc for doc in matches] or docs = list(nlp.pipe(matches)) will be inefficient or even impossible. If you're training on any significant amount of data, this won't be a great idea.

Even if it isn't literally impossible, you can do cool things faster if you use the generator as part of a pipeline instead of just dumping it into a list. If you want to extract only certain information, for example, to create a database column of just the named entities, or just the place names in your data, you wouldn't need to store the whole thing in a list and then do a nested for-loop to get them out.

Moreover, the Doc.spans item (and many others) are generators. Similar kinds of data types show up in gensim as well -- half the challenge of NLP is figuring out how to do this stuff in ways that will scale, so it's worth getting used to more efficient containers. (Plus, you can do cooler things with them!)

The official spaCy starter has some notes on scaling and performance in Chapter 3.

Upvotes: 8

Bayko
Bayko

Reputation: 1434

For iterating through docs just do

for item in docs

or do

 list_of_docs = list(docs)

Upvotes: 13

Related Questions