dimson
dimson

Reputation: 823

Low efficiency of Spark's application on standalone cluster

I'm trying to run spark application on standalone cluster. In this application I'm training Naive Bayes classifier by using tf-idf vectors.

I wrote application in similar manner to this post (Spark MLLib TFIDF implementation for LogisticRegression). The main difference, that I take each document, tokenize and normalize it:

JavaRDD<Document> termDocsRdd = sc.wholeTextFiles("D:/fileFolder").flatMap(new FlatMapFunction<Tuple2<String,String>, Document>() {
        @Override
        public Iterable<Document> call(Tuple2<String,String> tup) 
        { 
            return Arrays.asList(parsingFunction(tup)); 
        } 
    });

parsingFunction doesn't have any Spark functions like map or flatMap etc.. So it doesn't contain any data distribution functions.

My cluster is - One master machine and two another machines - nodes. All machines have 8 cores CPU and 16 GB RAM. I'm trying to train classifier on 20 text files (each ~ 100 KB - 1.5 MB). I don't use a distributed filesystem and put files directly to the nodes.

The problem is that my cluster doesn't work as fast as I thought - classifier trained about 5 minutes... In local mode this operation spent much less time.

On what should I pay attention?

I would appreciate any advice.

Thank You!

Upvotes: 0

Views: 373

Answers (1)

Josh Milthorpe
Josh Milthorpe

Reputation: 1038

Did you cache the RDD for the training data? An iterative algorithm like training a Bayes classifier will perform poorly unless the RDD is cached.

Upvotes: 1

Related Questions