Reputation: 501
All, trying to understand the Databricks Structured Streaming architecture.
Is this architecture diagram relevant for Structured Streaming as well?
If so here are my questions:
Q1: I see here the concept of reliable recievers.Where do these reliable recievers live? On the driver or worker. In otherwords, the reading to the source happens from the worker or driver?
Q2: As we see in the spark streaming official diagram, a reciever is a single machine that receives records. So if we have 20 partitions in EventHub Source, are we limited by the Driver's Core Restriction for the maximum concurrent reads? Otherwords, we can only perform concurrent reads to source not parallel?
Q3: Related to Q2, does this mean the parallelism in structured streaming can be achieved only for processing?
The below is my version of the architecture, please let me know if this needs any changes.
Upvotes: 0
Views: 145
Reputation: 501
As per my understanding from the spark streaming documentation
Answer for Q1 : The receivers live on the worker nodes
Answer for Q2 : Since the receivers run on workers, in case of a cluster, the driver's cores does not limit the receivers. Each receiver occupies a single core and gets allocated by a round-robin
Answer for Q3 : Read parallelism can be achieved by increasing the number of receivers/partitions on the source
These info is documented here
Please correct me if this is incorrect. Thanks.
Upvotes: 0