Make spacy nlp.pipe process tuples of text and additional information to add as document features?

Question

Apparently for doc in nlp.pipe(sequence) is much faster than running for el in sequence: doc = nlp(el) ..

The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).

I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.

Ines Montani · Accepted Answer

It sounds like you might be looking for the as_tuples argument of nlp.pipe? If you set as_tuples=True, you can pass in a stream of (text, context) tuples and spaCy will yield (doc, context) tuples (instead of just Doc objects). You can then use the context and add it to custom attributes etc.

Here's an example:

data = [
  ("Some text to process", {"meta": "foo"}),
  ("And more text...", {"meta": "bar"})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    # Let's assume you have a "meta" extension registered on the Doc
    doc._.meta = context["meta"]

Make spacy nlp.pipe process tuples of text and additional information to add as document features?

Answers (2)

Related Questions