Reputation: 4521
I have a spaCy doc
that I would like to lemmatize.
For example:
import spacy
nlp = spacy.load('en_core_web_lg')
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
How can I convert every token in the doc
to its lemma?
Upvotes: 6
Views: 10733
Reputation: 53
This answer covers the case where your text consists of multiple sentences.
If you want to obtain a list of all tokens being lemmatized, do:
import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)
words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output:
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.',
# 'a', 'python', 'be', 'an', 'animal', '.']
If you want to obtain a list of all sentences with each token being lemmatized, do:
sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']
Upvotes: 1
Reputation: 1446
If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.
For your case (Lemmatize a doc with spaCy) you only need the tagger component.
So here is a sample code:
import spacy
# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg', disable=["parser", "ner"])
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)
Output:
['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']
Upvotes: 6
Reputation: 456
Each token has a number of attributes, you can iterate through the doc to access them.
For example: [token.lemma_ for token in doc]
If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])
For a full list of token attributes see: https://spacy.io/api/token#attributes
Upvotes: 7