How to process files using Spark Structured Streaming chunk by chunk?

Question

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.

How can I do it using Spark Structured Streaming ?

I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.

Thanks a lot,

bp2010 · Accepted Answer

If using File Source:

maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)

spark
  .readStream
  .format("json")
  .path("/path/to/files")
  .option("maxFilesPerTrigger", 50)
  .load

If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.

How to process files using Spark Structured Streaming chunk by chunk?

Answers (1)

Related Questions