cwiq
cwiq

Reputation: 147

Time-consuming JavaRDD method take()

How to deal with time-consuming method provided by JavaRDD - take()

Instant startInit = Instant.now();
JavaRDD<Foo> fooJavaRDD = listOfFoo.parallize.map(new Foo()).sortBy(a -> a.sortRule(), true, NoPartitions);
Instant stopInit = Instant.now();

Instant startTake = Instant.now();
List<Foo> fooList = fooJavaRDD.take(1);
Instant stopTake = Instant.now();

System.out.println("Init: " + Duration.between(startInit, stopInit).toMillis());
System.out.println("Take: " + Duration.between(startTake, stopTake).toMillis());

Output I get (in millis):

Init: 417
Take: 1322

It's strange, that parallizing, map and sorting is not as time-consuming as take().

Maybe there is another way to take best result from map()?

Upvotes: 1

Views: 150

Answers (1)

Mike Pone
Mike Pone

Reputation: 19330

The map() actually doesn't get run until the take() is called in the code. the Spark client is smart enough to not call map() until the results of map() are needed, and it doesn't see that happening until take() is called. If you want the timing of take(), you can call repartition() or some other method that requires a shuffle in Spark. Then you will get the true time of take(). Right now you are getting the time of both map() and take(). It's not exactly intuitive and I have run into this many, many times.

Upvotes: 2

Related Questions