Reputation: 1642
I have a text file that I am trying to stem
after having removed stopwords
but it seems that nothing changes when I run it. My file is called data0
.
Here are my codes:
## Removing stopwords and tokenizing by words (split each word)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
data0 = word_tokenize(data0)
data0 = ' '.join([word for word in data0 if word not in (stopwords.words('english'))])
print(data0)
## Stemming the data
from nltk.stem import PorterStemmer
ps = PorterStemmer()
data0 = ps.stem(data0)
print(data0)
And I get the following (wrapped for legibility):
For us around Aberdeen , question `` What oil industry ? ( Evening Express , October 26 ) touch deja vu . That question asked almost since day first drop oil pumped North Sea . In past 30 years seen constant cycle ups downs , booms busts industry . I predict happen next . There period worry uncertainty scrabble find something keep local economy buoyant oil gone . Then upturn see jobs investment oil , everyone breathe sigh relief quest diversify go back burner . That downfall . Major industries prone collapse . Look nation 's defunct shipyards extinct coal steel industries . That 's vital n't panic downturns , start planning sensibly future . Our civic business leaders need constantly looking something secure prosperity - tourism , technology , bio-science emerging industries . We need economically strong rather waiting see happens oil roller coaster hits buffers . N JonesEllon
The first part of the code works fine (Removing stopwords and tokenizing), but us the second part (Stem) which does not work. Any idea why?
Upvotes: 3
Views: 3122
Reputation: 14619
Here's what I've done in the past w/NLTK:
st = PorterStemmer()
def stem_tokens(tokens):
for item in tokens:
yield st.stem(item)
def go(text):
tokens = nltk.word_tokenize(text)
return ' '.join(stem_tokens(tokens))
Upvotes: 1
Reputation: 455
From the Stemmer docs http://www.nltk.org/howto/stem.html, it looks like the Stemmer is designed to be called on a single word at a time.
Try running it on each word in
[word for word in data0 if word not in (stopwords.words('english'))]
I.e. before calling join
E.g.
stemmed_list = []
for str in [word for word in data0 if word not in (stopwords.words('english'))]:
stemmed_list.append(ps.stem(str))
Edit: Comment Response. I ran the following - and it seemed to stem correctly:
>>> from nltk.stem import PorterStemmer
>>> ps = PorterStemmer()
>>> data0 = '''<Your Data0 string>'''
>>> words = data0.split(" ")
>>> stemmed_words = map(ps.stem, words)
>>> print(list(stemmed_words)) # list cast needed because of 'map'
[..., 'industri', ..., 'diversifi']
I don't think there is a stemmer that can be applied straight to text, but you can wrap it in your own function that takes 'ps' and the text:
def my_stem(text, stemmer):
words = text.split(" ")
stemmed_words = map(stemmer, words)
result = " ".join(list(stemmed_words))
return result
Upvotes: 3