max
max

Reputation: 4521

Lemmatize a doc with spacy?

I have a spaCy doc that I would like to lemmatize.

For example:

import spacy
nlp = spacy.load('en_core_web_lg')

my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)

How can I convert every token in the doc to its lemma?

Upvotes: 6

Views: 10733

Answers (3)

Matthias S
Matthias S

Reputation: 53

This answer covers the case where your text consists of multiple sentences.

If you want to obtain a list of all tokens being lemmatized, do:

import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)

words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output: 
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.', 
# 'a', 'python', 'be', 'an', 'animal', '.']

If you want to obtain a list of all sentences with each token being lemmatized, do:

sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']

Upvotes: 1

Kyrylo Malakhov
Kyrylo Malakhov

Reputation: 1446

If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.

For your case (Lemmatize a doc with spaCy) you only need the tagger component.

So here is a sample code:

import spacy

# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg',  disable=["parser", "ner"])

my_str = 'Python is the greatest language in the world'

doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)

Output:

['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']

Upvotes: 6

ame
ame

Reputation: 456

Each token has a number of attributes, you can iterate through the doc to access them.

For example: [token.lemma_ for token in doc]

If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])

For a full list of token attributes see: https://spacy.io/api/token#attributes

Upvotes: 7

Related Questions