Reputation: 278
I am currently using NLTK's SnowballStemmer to stem the words in my documents and this was working fine when I had 68 documents. Now I have 4000 documents and this is way too slow. I read another post where someone suggested to use PyStemmer
, but this is not offered on Python 3.6 Are there any other packages that would do the trick? Or maybe there's something I can do in the code to speed up the process.
Code:
eng_stemmer = nltk.stem.SnowballStemmer('english')
...
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([eng_stemmer.stem(w) for w in analyzer(doc)])
Upvotes: 1
Views: 1643
Reputation:
PyStemmer does not say that it works with python 3.6 in its documentation but it actually does. Install the proper Visual Studio C++ Build compatible with python 3.6 which you can find here: http://landinghub.visualstudio.com/visual-cpp-build-tools
And then try pip install pystemmer
If that doesn't work then make sure you install manually exactly as it says here: https://github.com/snowballstem/pystemmer
Upvotes: 1