Reputation: 2097
Apparently for doc in nlp.pipe(sequence)
is much faster than running for el in sequence: doc = nlp(el) ..
The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).
I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.
Upvotes: 2
Views: 1578
Reputation: 1580
A bit late, but in case someone comes looking for this in 2022:
There is no official/documented way to access the context (the second tuple) for the Doc
object from within a pipeline. However, the context does get written to an internal doc._context
attribute, so we can use this internal attribute to access the context from within our pipelines.
For example:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
data = [
("stackoverflow is great", {"attr1": "foo", "attr2": "bar"}),
("The sun is shining today", {"location": "Hawaii"})
]
# Set up your custom pipeline. You can access the doc's context from
# within your pipeline, such as {"attr1": "foo", "attr2": "bar"}
@Language.component("my_pipeline")
def my_pipeline(doc):
print(doc._context)
return doc
# Add the pipeline
nlp.add_pipe("my_pipeline")
# Process the data and do something with the doc and/or context
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc)
print(context)
If you are interested in the source code, see the nlp.pipe
method and the internal nlp._ensure_doc_with_context
methods here: https://github.com/explosion/spaCy/blob/6b83fee58db27cee70ef8d893cbbf7470db4e242/spacy/language.py#L1535
Upvotes: 0
Reputation: 7105
It sounds like you might be looking for the as_tuples
argument of nlp.pipe
? If you set as_tuples=True
, you can pass in a stream of (text, context)
tuples and spaCy will yield (doc, context)
tuples (instead of just Doc
objects). You can then use the context and add it to custom attributes etc.
Here's an example:
data = [
("Some text to process", {"meta": "foo"}),
("And more text...", {"meta": "bar"})
]
for doc, context in nlp.pipe(data, as_tuples=True):
# Let's assume you have a "meta" extension registered on the Doc
doc._.meta = context["meta"]
Upvotes: 5