Reputation: 183
Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs. I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently. Currently, because of small input size, the MapReduce jobs are taking way too long to complete.
Is there a way to configure the streaming job to wait for at least 'X' number of records to process?
Upvotes: 0
Views: 612
Reputation: 18108
No there is not in reality.
No for Spark prior to 3.x.
Yes and No for Spark 3.x which equates to No effectively.
minOffsetsPerTrigger
was introduced but has a catch as per below. That means the overall answer still remains No.
From the manuals:
Minimum number of offsets to be processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. Note, if the maxTriggerDelay is exceeded, a trigger will be fired even if the number of available offsets doesn't reach minOffsetsPerTrigger.
Upvotes: 0
Reputation: 166
Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka . Make sure to set sufficient maxTriggerDelay time as per your application need .
Upvotes: 0