spacy 3 - lemma_ returned will be empty string

Question

I normalize ten thousands of docs using spacy 3.

To speed up the process, I try this way,

nlp = spacy.load('en_core_web_sm')
docs = nlp.tokenizer.pipe(doc_list)
return [[word.lemma_ for word in doc if word.is_punct == False and word.is_stop == False] for doc, _ in doc_list]

but all lemma_ returned would be empty string.

So I directly use nlp(doc) like the following, but it's too slow.

a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]

How can I do this properly?

Narayan Acharya · Accepted Answer

The difference is in how you are creating the docs.

In the first example you use nlp.tokenizer.pipe() - this will only run the tokenizer on all your docs but not the lemmatizer. So, all you get is your docs split into tokens but the lemma_ attribute is not set.
In the second example you use nlp(doc) this will run all the default components (which are ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']. Since the lemmatizer is part of the pipeline the lemma_ attribute is set. But, it slower because you are running all the components, even the ones you don't need.

What you should be doing:

import spacy

# Exclude components not required when loading the spaCy model.
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec", "parser", "ner", "attrbute_ruler"]) 

# Extract lemmas as required.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]

spacy 3 - lemma_ returned will be empty string

Answers (1)

Related Questions