Reputation: 49
I normalize ten thousands of docs using spacy 3.
nlp = spacy.load('en_core_web_sm')
docs = nlp.tokenizer.pipe(doc_list)
return [[word.lemma_ for word in doc if word.is_punct == False and word.is_stop == False] for doc, _ in doc_list]
but all lemma_
returned would be empty string.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]
How can I do this properly?
Upvotes: 2
Views: 937
Reputation: 1499
The difference is in how you are creating the docs.
nlp.tokenizer.pipe()
- this will only run the tokenizer on all your docs but not the lemmatizer. So, all you get is your docs split into tokens but the lemma_
attribute is not set.nlp(doc)
this will run all the default components (which are ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
. Since the lemmatizer
is part of the pipeline the lemma_
attribute is set. But, it slower because you are running all the components, even the ones you don't need.What you should be doing:
import spacy
# Exclude components not required when loading the spaCy model.
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec", "parser", "ner", "attrbute_ruler"])
# Extract lemmas as required.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]
Upvotes: 2