William Que
William Que

Reputation: 49

spacy 3 - lemma_ returned will be empty string

I normalize ten thousands of docs using spacy 3.

  1. To speed up the process, I try this way,
nlp = spacy.load('en_core_web_sm')
docs = nlp.tokenizer.pipe(doc_list)
return [[word.lemma_ for word in doc if word.is_punct == False and word.is_stop == False] for doc, _ in doc_list]

but all lemma_ returned would be empty string.

  1. So I directly use nlp(doc) like the following, but it's too slow.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]

How can I do this properly?

Upvotes: 2

Views: 937

Answers (1)

Narayan Acharya
Narayan Acharya

Reputation: 1499

The difference is in how you are creating the docs.

  1. In the first example you use nlp.tokenizer.pipe() - this will only run the tokenizer on all your docs but not the lemmatizer. So, all you get is your docs split into tokens but the lemma_ attribute is not set.
  2. In the second example you use nlp(doc) this will run all the default components (which are ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']. Since the lemmatizer is part of the pipeline the lemma_ attribute is set. But, it slower because you are running all the components, even the ones you don't need.

What you should be doing:

import spacy

# Exclude components not required when loading the spaCy model.
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec", "parser", "ner", "attrbute_ruler"]) 

# Extract lemmas as required.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]

Upvotes: 2

Related Questions