Reputation: 1215
Simple question: When do we stem or lemmatize the words? Is stemming helpful for all nlp processes or are there applications where using full form of words might result in better accuracy or precision?
Upvotes: 8
Views: 6131
Reputation: 575
Stemming is very useful for various tasks. If you are doing document similarity, for example, its far better to normalize the data. Remove the genitive, stop words, lowercase everything, strip punctuation and uniflect. Another suggestion is to sort the words. That isn't so bad with bigrams but might look odd with much bigger terms.
Stack Exchange's
stack exchange
STACK EXCHANGE
Exchange, Stack
Stack Exchange (WEB)
StAcK Exchanges
All of those should normalize to "exchange stack" for purposes of further computation.
Upvotes: 2
Reputation: 15422
In the context of machine learning based NLP, stemming makes your training data more dense. It reduces the size of the dictionary (number of words used in the corpus) two or three-fold (of even more for languages with many flections like French, where a single stem can generate dozens of words in case of verbs for instance).
Having the same corpus, but less input dimensions, ML will work better. Recall should really be better.
The downside is, if in some cases the actual word (as opposed to its stem) makes a difference, then your system won't be able to leverage it. So you might lose some precision.
Upvotes: 11
Reputation: 65599
When do we stem or lemmatize the words?
Stemming is a useful "normalization" technique for words. Consider as an example searching over a corpus of documents. More specifically, we might prepare a bunch of documents to be searchable in some kind of search index. When creating the search index we take similar terms and stem them to a root word so that searches on other forms of the word match our document.
Consider, for e the following terms
Lets say we convert each of these to the term index
in our search index. Whenever we encounter one of these, we'll use the root form "index" instead of the word present in the document.
Similarly we perform the same step before running a search query, such as database indexing
.
The query will be transformed to database index
, matching all the documents that have any form of "index" in them, most like increasing the relevance of our search results.
In full-text search, keeping the stems is useful when performing a phrase search where we might spell out a grammatically correct phrase. Something like the exact phrase "Doug likes indexing databases"
. We would want the full "indexing" in full-text search in that context.
Upvotes: 3