Reputation: 1590
I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.
How can I do it using Spark Structured Streaming ?
I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.
Thanks a lot,
Upvotes: 3
Views: 6875
Reputation: 2472
If using File Source:
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
spark
.readStream
.format("json")
.path("/path/to/files")
.option("maxFilesPerTrigger", 50)
.load
If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger
option.
Upvotes: 6