Wiliam
Wiliam

Reputation: 1088

Finding the original form of a word after stemming

I am stemming a list of words and making a dataframe from it. The original data is as follow:

original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})

the functions I've used for lemmatisation and stemming are:

# lemmatize & stemmed.
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize_stemming(token))
    return result

The output will come from running df['text'].map(preprocess)[0] for which I get:

['man',
 'fli',
 'airplan',
 'die',
 'air',
 'crash',
 'wife',
 'die',
 'coupl',
 'week',
 'ago']

I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.

Upvotes: 1

Views: 782

Answers (2)

gojomo
gojomo

Reputation: 54183

Stemming destroys information in the original corpus, by non-reversibly turning multiple tokens into some shared 'stem' form.

I you want the original text, you need to retain it yourself.

But also, note: many algorithms working on large amounts of data, like word2vec under ideal conditions, don't necessarily need or even benefit from stemming. You want to have vectors for all the words in the original text – not just the stems – and with enough data, the related forms of a word will get similar vectors. (Indeed, they'll even differ in useful ways, with all 'past' or 'adverbial' or whatever variants sharing a similar directional skew.)

So only do it if you're sure it's helping your goals, within your corpus limits & goals.

Upvotes: 1

jadore801120
jadore801120

Reputation: 77

You could return the mapping relationship as the result and perform postprocessing later.

def preprocess(text):
    lemma_mapping = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            lemma_mapping[token] = lemmatize_stemming(token)
    return lemma_mapping

Or store it as a by-product.

from collections import defaultdict

lemma_mapping = defaultdict(str)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            lemma = lemmatize_stemming(token)
            result.append(lemma)
            lemma_mapping[token] = lemma
    return result

Upvotes: 0

Related Questions