Reputation: 823
I'm trying to run spark application on standalone cluster. In this application I'm training Naive Bayes classifier by using tf-idf vectors.
I wrote application in similar manner to this post (Spark MLLib TFIDF implementation for LogisticRegression). The main difference, that I take each document, tokenize and normalize it:
JavaRDD<Document> termDocsRdd = sc.wholeTextFiles("D:/fileFolder").flatMap(new FlatMapFunction<Tuple2<String,String>, Document>() {
@Override
public Iterable<Document> call(Tuple2<String,String> tup)
{
return Arrays.asList(parsingFunction(tup));
}
});
parsingFunction doesn't have any Spark functions like map or flatMap etc.. So it doesn't contain any data distribution functions.
My cluster is - One master machine and two another machines - nodes. All machines have 8 cores CPU and 16 GB RAM. I'm trying to train classifier on 20 text files (each ~ 100 KB - 1.5 MB). I don't use a distributed filesystem and put files directly to the nodes.
The problem is that my cluster doesn't work as fast as I thought - classifier trained about 5 minutes... In local mode this operation spent much less time.
On what should I pay attention?
I would appreciate any advice.
Thank You!
Upvotes: 0
Views: 373
Reputation: 1038
Did you cache the RDD for the training data? An iterative algorithm like training a Bayes classifier will perform poorly unless the RDD is cached.
Upvotes: 1