Creating an distributed RDD in Spark

Question

I am aware that to create an RDD, we have 2 ways:

Parallelise an existing collection in a driver program.
Referencing Data from an external storage system such as HDFS, HBase, etc.

However, I would like to know what happens when I read data from a Data Lake such as (Azure Data Lake Storage Gen 2, ADLS Gen 2). For instance, If I have the following command:

df = spark.read.csv("path to ADLS Gen 2").rdd

I would like to know how the data is read; is it into the driver? Or is it directly into the worker nodes?

Then, where does the processing happen in case we applied some transformation on the Dataframe or RDD? this question exists only if the data is loaded into the driver node.

Please note that I am new to Spark and I'm still learning about the tool.

pandy giankoulidis · Accepted Answer

The data are read on worker nodes,unless the program running on the cluster forces the driver node to read them.Of course,Spark workers don't load the entire RDD on their local memory;which rdd partition goes to which worker is handled by the driver manager.

This means that when you apply transformations on a session,the Spark takes the following steps:

1.Creates a DAG for computing transformations and actions in the most possible efficient way.

2.Sends a jar file,having general info about the program and specific info about the processing this worker must apply, to all active workers of the cluster.

The above are given in a very abstract way ,since there are much more going inside a spark cluster when deploying an application,but the main idea is that workers read the files and what they have to do to them,is coming from the driver through the network

Creating an distributed RDD in Spark

Answers (1)

Related Questions