How to check that reduceByKey efficiently executed at spark-streaming

Question

I'm running Spark 2.3.1 standalone cluster. My job is consuming from Kafka mini batches every 2 minutes and writing aggregation to some store. Job looks like as following:

val stream = KafkaUtils.createDirectStream(...)
stream.map(x=> Row(...))
   .flatMap(r=> ... List[Row] )
   .map(r=> (k,r))
   .reduceByKey((r1, r2) => r)
   .map { case (_, v) => v} 
   .foreachRDD { (rdd, time) => // write data}

When I look into the DAG the picture is as following

My question - as far as I understand spark should use a combiner for reduceByKey operation, that should significantly reduce the shuffle size. Why DAG doesn't show this, and how can I check that?

Additional question - if shuffle size is 2.5G does it hit disk? What configuration properties/metrics should I look into to check that job configured and run optimally. For this job the executors run with 10G memory

How to check that reduceByKey efficiently executed at spark-streaming

Answers (1)

Related Questions