Spark Structured Streaming large files

Question

I have two large json files that we stream over kafka. So one of the file pushed in Patition 0 of Topic1 and another file pushed in partition 1 of Topic1. We use spark streaming query using watermark to join these files and perform necessary calculation. Though we are joining these files and perform simple calcuations, we do find in Spark UI that there are more than 200 tasks been done by Spark engine which takes more than 6 mins.These are the stats on a box which has 2 cores and 8 GB Ram.

Below are few questions we have: 1) Why there are so many tasks for these simple operations? 2) Does large JSON splits among mutilple executors? As per my understaning, its not possible to perform operations on the split part of JSON. It has to be on one the executor. Does it mean that we cannot split large xml or json among multiple executors to increase parallelism?

Thanks

Spark Structured Streaming large files

Answers (1)

Related Questions