SpaCy, apply extensions during pipe

Question

In SpaCy you can set extensions for documents like this:

Doc.set_extension('chapter_id', default='')

doc = nlp('This is my text')
doc._.chapter_id = 'This is my ID'

However, I'm having thousands of text files that should be handled by NLP. And SpaCy suggests to use pipe for this:

docs = nlp.pipe(array_of_texts)

How to apply my extension values during pipe?

Ines Montani · Accepted Answer

You probably want to enable the as_tuples keyword argument on nlp.pipe, which lets you pass in a list of (text, context) tuples and will yield out (doc, context) tuples. So you could do something like this:

data = [('Some text', 1), ('Some other text', 2)]

def process_text(data):
    for doc, chapter_id in nlp.pipe(data, as_tuples=True):
        doc._.chapter_id = chapter_id
        yield doc

SpaCy, apply extensions during pipe

Answers (1)

Related Questions