Spark Streaming: Issues when processing time batch time

Question

I am running a spark streaming (1.6.1) on yarn using DirectAPI to read events from Kafka topic having 50 partitions and writing on HDFS. I have a batch interval of 60 seconds. I was receiving around 500K messages which was getting processed under 60 Sec.

Suddenly spark started receiving 15-20 million messages which took around 5-6 minutes to process with a batch interval of 60 seconds. I have configured "spark.streaming.concurrentJobs=4".

So when batch takes a long time for processing spark initiate concurrent 4 active tasks to handle the backlog batches but still over a period of time batch backlog increases as batch interval is too less for such volume of data.

I have few doubts around this.

When I start receiving 15-20 million messages & time to process those messages is around 5-6 minutes with batch interval of 60 Sec. When I check my HDFS directory I see the files created for each 60 Sec with 50 part files, I am little confused here my batch is getting processed in 5-6 minutes, then how it is writing files on HDFS every 1 min & 'saveAsTextFile' action is called only once per batch. Total records from all the files 50 part files comes around 3.3 million.
In order to handle the processing of 15-20 million messages, I configured my batch interval to 8-10 minutes now spark started consuming around 35-40 million messages from Kafka & again its processing time started exceeding batch interval.

I have configured 'spark.streaming.kafka.maxRatePerPartition=50' & 'spark.streaming.backpressure.enabled=true'.

Spark Streaming: Issues when processing time > batch time

Answers (1)

Related Questions

Spark Streaming: Issues when processing time &gt; batch time

Answers (1)

Related Questions

Spark Streaming: Issues when processing time > batch time