Reputation: 308
I'm not an expert in Spark, and I'm using Spark to do some calculations.
// [userId, lastPurchaseLevel]
JavaPairRDD<String, Integer> lastPurchaseLevels =
levels.groupByKey()
.join(purchases.groupByKey())
.mapValues(t -> getLastPurchaseLevel(t));
And inside the getLastPurchaseLevel() function, I had such code:
private static Integer getLastPurchaseLevel(Tuple2<Iterable<SourceLevelRecord>, Iterable<PurchaseRecord>> t) {
....
final Comparator<PurchaseRecord> comp = (a, b) -> Long.compare(a.dateMsec, b.dateMsec);
PurchaseRecord latestPurchase = purchaseList.stream().max(comp).get();
But my boss told me not to use the stream(), he said:
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
I know the classic way is to iterate through and find the max, so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
Upvotes: 1
Views: 790
Reputation: 359
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
Your boss's idea: Spark already schedules the tasks to threads ( or cpu cores ), no need to do things concurrently inside single task.
... so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
Java stream is single threaded unless otherwise specified ( by calling Stream.parallel() method ). So as long as you don't parallelize the stream, your boss won't complain.
Upvotes: 1