Reputation: 1088
I am stemming a list of words and making a dataframe from it. The original data is as follow:
original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})
the functions I've used for lemmatisation and stemming are:
# lemmatize & stemmed.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(lemmatize_stemming(token))
return result
The output will come from running df['text'].map(preprocess)[0]
for which I get:
['man',
'fli',
'airplan',
'die',
'air',
'crash',
'wife',
'die',
'coupl',
'week',
'ago']
I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.
Upvotes: 1
Views: 782
Reputation: 54183
Stemming destroys information in the original corpus, by non-reversibly turning multiple tokens into some shared 'stem' form.
I you want the original text, you need to retain it yourself.
But also, note: many algorithms working on large amounts of data, like word2vec under ideal conditions, don't necessarily need or even benefit from stemming. You want to have vectors for all the words in the original text – not just the stems – and with enough data, the related forms of a word will get similar vectors. (Indeed, they'll even differ in useful ways, with all 'past' or 'adverbial' or whatever variants sharing a similar directional skew.)
So only do it if you're sure it's helping your goals, within your corpus limits & goals.
Upvotes: 1
Reputation: 77
You could return the mapping relationship as the result and perform postprocessing later.
def preprocess(text):
lemma_mapping = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma_mapping[token] = lemmatize_stemming(token)
return lemma_mapping
Or store it as a by-product.
from collections import defaultdict
lemma_mapping = defaultdict(str)
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma = lemmatize_stemming(token)
result.append(lemma)
lemma_mapping[token] = lemma
return result
Upvotes: 0