user8871463
user8871463

Reputation:

Get the word from stem (stemming)

I am using porter stemmer as follows to get the stem of my words.

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

Now, I want to know the possibility of some word from the stem to make it readable. For example environ to environment or educ to education etc. Is it possible to do?

Upvotes: 2

Views: 1374

Answers (2)

alvas
alvas

Reputation: 122092

As @MikeDinescu explained stemming is lossy and "un-stemming" is not that simple.

But given that you have a fixed list of vocabulary and a list of stems, you can you can compare each stems against all entries in your vocabulary and calculate some sort of string distance.

Here's an example https://gist.github.com/alvations/a4a6e0cc24d2fd9aff86

Upvotes: 0

Mike Dinescu
Mike Dinescu

Reputation: 55730

What, so you want to take a stem and map it to a list of possible words in a dictionary that stem back to it?

This is difficult because the stemming process is lossy and because it's not a 1:1 transformation.

That said, in some cases like environ -> {environment, environments, environmental} and educ -> {educate, educational, education, educated, educating} you can get by with a trie structure where you do a prefix lookup. Things get more interesting for stems like happi which has to map back to happy

In the general case, you would have to start with a dictionary and then produce an inverted index by stemming each word and mapping the stem back to the source word in the index. Using the inverted index you can then look up matches given a stem.

Hope this helps..

Upvotes: 3

Related Questions