Low efficiency of Spark's application on standalone cluster

Question

I'm trying to run spark application on standalone cluster. In this application I'm training Naive Bayes classifier by using tf-idf vectors.

I wrote application in similar manner to this post (Spark MLLib TFIDF implementation for LogisticRegression). The main difference, that I take each document, tokenize and normalize it:

JavaRDD termDocsRdd = sc.wholeTextFiles("D:/fileFolder").flatMap(new FlatMapFunction, Document>() {
        @Override
        public Iterable call(Tuple2 tup) 
        { 
            return Arrays.asList(parsingFunction(tup)); 
        } 
    });

parsingFunction doesn't have any Spark functions like map or flatMap etc.. So it doesn't contain any data distribution functions.

My cluster is - One master machine and two another machines - nodes. All machines have 8 cores CPU and 16 GB RAM. I'm trying to train classifier on 20 text files (each ~ 100 KB - 1.5 MB). I don't use a distributed filesystem and put files directly to the nodes.

The problem is that my cluster doesn't work as fast as I thought - classifier trained about 5 minutes... In local mode this operation spent much less time.

On what should I pay attention?

I would appreciate any advice.

Thank You!

Josh Milthorpe · Accepted Answer

Did you cache the RDD for the training data? An iterative algorithm like training a Bayes classifier will perform poorly unless the RDD is cached.

Low efficiency of Spark's application on standalone cluster

Answers (1)

Related Questions

Low efficiency of Spark&#39;s application on standalone cluster

Answers (1)

Related Questions

Low efficiency of Spark's application on standalone cluster