Reputation: 6382
I'm building a system that does text classification. I'm building the system in Java. As features I'm using the bag-of-words model. However one problem with such a model is that the number of features is really high, which makes it impossible to fit the data in memory.
However, I came across this tutorial from Scikit-learn which uses specific data structures to solve the issue.
My questions:
1 - How do people solve such an issue using Java in general?
2- Is there a solution similar to the solution given in scikit-learn?
Edit: the only solution I've found so far is to personally write a Sparse Vector implementation using HashTables.
Upvotes: 1
Views: 234
Reputation: 708
HashSet/HashMap are the usual way people store bag-of-words vectors in Java - they are naturally sparse representations that grow not with the size of dictionary but with the size of document, and the latter is usually much smaller.
If you deal with unusual scenarios, like very big document/representations, you can look for a few sparse bitset implementations around, they may be slightly more economical in terms of memory and are used for massive text classification implementations based on Hadoop, for example.
Most NLP frameworks make this decision for you anyway - you need to supply things in the format the framework wants them.
Upvotes: 1
Reputation: 1697
If you want to build this system in Java, I suggest you use Weka, which is a machine learning software similar to sklearn. Here is a simple tutorial about text classification with Weka:
https://weka.wikispaces.com/Text+categorization+with+WEKA
You can download Weka from:
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
Upvotes: 1