Reputation: 968
Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!
My Data Set has the following type of Data:
Doc Words Category
Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:
Doc Words Category
Training
1 Integration Communities Process Oriented Structures...(more string) A
2 Integration Communities Process Oriented Structures...(more string) A
3 Theory Upper Bound Routing Estimate global routing...(more string) B
4 Hardware Design Functional Programming Perfect Match...(more string) C
.
.
.
Test
5 Methodology Toolkit Integrate Technological Organisational
6 This test contain string naive bayes test text text test
SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.
The algorithm should follow the following example mentioned here Table 13.1
Source: Read here
The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.
The problem is I just need the results for just one time only means its just a test for results.
So, come to the point can somebody tell me about any good java library that helps my code this algorithm in Java and that could made my dataset possible to process the results, or can somebody give me any good ideas how to do it easily...something good that can help me.
I will be thankful for your help. Thanks in advance
Upvotes: 5
Views: 13620
Reputation: 1
You can take a look at Blayze - It's a pretty minimal Naive Bayes library for the JVM written in Kotlin. Should be easy to follow.
Full disclosure: I'm one of the authors of Blayze
Upvotes: 0
Reputation: 41
If you want to implement Naive Bayes Text Classification Algorithm in Java, then WEKA Java API will be a better solution. The data set should have to be in .arff format. Creating an .arff file from mySql database is very easy. Here is the attachment of the java code for the classifier a link of a sample .arff file.
Create a new Text document. Open it with Notepad. Copy and paste all the texts below the link. Save it as DataSet.arff. http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.arff
Download Weka Java API: http://www.java2s.com/Code/Jar/w/weka.htm
Code for the classifier:
public static void main(String[] args) {
try {
StringBuilder txtAreaShow = new StringBuilder();
//reads the arff file
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("DataSet.arff"));
//if 40 attributes availabe then 39 will be the class index/attribuites(yes/no)
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
breader.close();
//
NaiveBayes nB = new NaiveBayes();
nB.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.crossValidateModel(nB, train, 10, new Random(1));
System.out.println("Run Information\n=====================");
System.out.println("Scheme: " + train.getClass().getName());
System.out.println("Relation: ");
System.out.println("\nClassifier Model(full training set)\n===============================");
System.out.println(nB);
System.out.println(eval.toSummaryString("\nSummary Results\n==================", true));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toMatrixString());
//txtArea output
txtAreaShow.append("\n\n\n");
txtAreaShow.append("Run Information\n===================\n");
txtAreaShow.append("Scheme: " + train.getClass().getName());
txtAreaShow.append("\n\nClassifier Model(full training set)"
+ "\n======================================\n");
txtAreaShow.append("" + nB);
txtAreaShow.append(eval.toSummaryString("\n\nSummary Results\n==================\n", true));
txtAreaShow.append(eval.toClassDetailsString());
txtAreaShow.append(eval.toMatrixString());
txtAreaShow.append("\n\n\n");
System.out.println(txtAreaShow.toString());
} catch (FileNotFoundException ex) {
System.err.println("File not found");
System.exit(1);
} catch (IOException ex) {
System.err.println("Invalid input or output.");
System.exit(1);
} catch (Exception ex) {
System.err.println("Exception occured!");
System.exit(1);
}
Upvotes: 1
Reputation: 65
You can use the Weka Java API and include it in your project if you do not want to use the GUI.
Here's a link to the documentation to incorporate a classifier in your code: https://weka.wikispaces.com/Use+WEKA+in+your+Java+code
Upvotes: 2
Reputation: 1036
As per your requirement, you can use the Machine learning library MLlib from apache. The MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:
Implement the java skeleton for the Naive Bayes provided on their site as given below.
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
@Override public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
@Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. This is pretty easy using the library. For the next step to allow the processing of your datasets possible, just use the Spark SQL. I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.
Upvotes: 3
Reputation: 4918
You can use an algorithm platform like KNIME, it has variety of classification algorithms (Naive bayed included). You can run it with a GUI or Java API.
Upvotes: 1
Reputation: 4925
Please use scipy from python. There is already an implementation of what you need:
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)¶
Upvotes: 1
Reputation: 450
You may want to take a look at this.
https://mahout.apache.org/users/classification/bayesian.html
Upvotes: 1
Reputation: 404
Hi I thinks Spark would help you a lot: http://spark.apache.org/docs/1.2.0/mllib-naive-bayes.html you can even choose the language you think is the most appropriate to your needs Java / Python / Scala!
Upvotes: 1
Reputation: 12339
Please take a look at the Bow toolkit.
It has a Gnu license and source code. Some of its code includes
Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.
Performing test/train splits, and automatic classification tests.
It's not a Java library, but you could compile the C code to ensure that you Java had similar results for a given corpus.
I also spotted a decent Dr. Dobbs article that implements in Perl. Once again, not the desired Java, but will give you the one-time results that you are asking for.
Upvotes: 1