nilesh1212
nilesh1212

Reputation: 1655

Spark Streaming: Issues when processing time > batch time

I am running a spark streaming (1.6.1) on yarn using DirectAPI to read events from Kafka topic having 50 partitions and writing on HDFS. I have a batch interval of 60 seconds. I was receiving around 500K messages which was getting processed under 60 Sec.

Suddenly spark started receiving 15-20 million messages which took around 5-6 minutes to process with a batch interval of 60 seconds. I have configured "spark.streaming.concurrentJobs=4".

So when batch takes a long time for processing spark initiate concurrent 4 active tasks to handle the backlog batches but still over a period of time batch backlog increases as batch interval is too less for such volume of data.

I have few doubts around this.

  1. When I start receiving 15-20 million messages & time to process those messages is around 5-6 minutes with batch interval of 60 Sec. When I check my HDFS directory I see the files created for each 60 Sec with 50 part files, I am little confused here my batch is getting processed in 5-6 minutes, then how it is writing files on HDFS every 1 min & 'saveAsTextFile' action is called only once per batch. Total records from all the files 50 part files comes around 3.3 million.

  2. In order to handle the processing of 15-20 million messages, I configured my batch interval to 8-10 minutes now spark started consuming around 35-40 million messages from Kafka & again its processing time started exceeding batch interval.

I have configured 'spark.streaming.kafka.maxRatePerPartition=50' & 'spark.streaming.backpressure.enabled=true'.

Upvotes: 6

Views: 2704

Answers (1)

Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21563

I think one thing that may have confused you is the relationship between the length of a job, and the frequency.

From what you describe, with the resources available it seems that in the end the job took about 5 minutes to complete. However your batch frequency is 1 minute.

So as a result, every 1 minute you kick off some batch that takes 5 minutes to complete.

As a result, in the end you will expect to see HDFS receive nothing for the first few minutes, and then you keep receiving something every 1 minute (but with a 5 minute 'delay' from when the data went in).

Upvotes: 0

Related Questions