Reputation: 5892
I'm running Spark 2.3.1 standalone cluster. My job is consuming from Kafka mini batches every 2 minutes and writing aggregation to some store. Job looks like as following:
val stream = KafkaUtils.createDirectStream(...)
stream.map(x=> Row(...))
.flatMap(r=> ... List[Row] )
.map(r=> (k,r))
.reduceByKey((r1, r2) => r)
.map { case (_, v) => v}
.foreachRDD { (rdd, time) => // write data}
When I look into the DAG the picture is as following
My question - as far as I understand spark should use a combiner for reduceByKey operation, that should significantly reduce the shuffle size. Why DAG doesn't show this, and how can I check that?
Additional question - if shuffle size is 2.5G does it hit disk? What configuration properties/metrics should I look into to check that job configured and run optimally. For this job the executors run with 10G memory
Upvotes: 0
Views: 44
Reputation: 18003
First question: reduceByKey internally calls combineBykey. You will not see a difference on the DAG execution as a result, i.e. tasks the same.
2nd question, please make a new posting. Since you have not, Details for Stage, Shuffle Spill Disk should give you an indication.
Upvotes: 1