Julias
Julias

Reputation: 5892

How to check that reduceByKey efficiently executed at spark-streaming

I'm running Spark 2.3.1 standalone cluster. My job is consuming from Kafka mini batches every 2 minutes and writing aggregation to some store. Job looks like as following:

val stream = KafkaUtils.createDirectStream(...)
stream.map(x=> Row(...))
   .flatMap(r=> ... List[Row] )
   .map(r=> (k,r))
   .reduceByKey((r1, r2) => r)
   .map { case (_, v) => v} 
   .foreachRDD { (rdd, time) => // write data}

When I look into the DAG the picture is as following

enter image description here My question - as far as I understand spark should use a combiner for reduceByKey operation, that should significantly reduce the shuffle size. Why DAG doesn't show this, and how can I check that?

Additional question - if shuffle size is 2.5G does it hit disk? What configuration properties/metrics should I look into to check that job configured and run optimally. For this job the executors run with 10G memory

Upvotes: 0

Views: 44

Answers (1)

Ged
Ged

Reputation: 18003

First question: reduceByKey internally calls combineBykey. You will not see a difference on the DAG execution as a result, i.e. tasks the same.

2nd question, please make a new posting. Since you have not, Details for Stage, Shuffle Spill Disk should give you an indication.

Upvotes: 1

Related Questions