Reputation: 984
When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.
A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.
code:
nlp = spacy.load('es_core_news_sm')
def lemmatizer(text):
doc = nlp(text)
return ' '.join([word.lemma_ for word in doc])
df['column'] = df['column'].apply(lambda x: lemmatizer(x))
I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:
text = 'personas, ideas, cosas'
# translation: persons, ideas, things
print(lemmatizer(text))
# Current output:
personar , ideo , coser
# translation:
personify, ideo, sew
# The expected output should be:
persona, idea, cosa
# translation:
person, idea, thing
Upvotes: 10
Views: 13063
Reputation: 63
Maybe you can use FreeLing, this library offers, among many functionalities lemmatization in Spanish, Catalan, Basque, Italian and other languages.
In my experience, lemmatization in Spanish and Catalan is quite accurate and although it natively supports C++, it has an API for Python and another for Java.
Upvotes: 3
Reputation: 164
You can use spacy-stanza. It has spaCy's API with the Stanza's models:
import stanza
from spacy_stanza import StanzaLanguage
text = "personas, ideas, cosas"
snlp = stanza.Pipeline(lang="es")
nlp = StanzaLanguage(snlp)
doc = nlp(text)
for token in doc:
print(token.lemma_)
Upvotes: 1
Reputation: 226
Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.
I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!
Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.
Upvotes: 21
Reputation: 2079
One option is to make your own lemmatizer.
This might sound frightening, but fear not! It is actually very simple to do one.
I've recently made a tutorial on how to make a lemmatizer, the link is here:
https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c
As a summary, you'd have to:
In code, it'd look like this:
def lemmatize(word, pos):
if word in dict:
if pos in dict[word]:
return dict[word][pos]
return word
Simple, right?
In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:
https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6
Hope you get it solved.
Upvotes: 3