Kumar Roshan Mehta
Kumar Roshan Mehta

Reputation: 3306

Apache Spark Correlation only runs on driver

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.

I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

How could I find what part of the correlation has happened at the driver and what at executor?

Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's. Look here for the images from the SparK web UI: Distributed cross correlation matrix computation

Update 2

I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor. submitting the job like this ./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py from master node

My correlation sample file is correlations_example.py:

data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose()) 
print(Statistics.corr(data, method="pearson")) 
sc.stop()

I always get a sequential timeline as :

enter image description here

Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?

Update 3: I tried even adding another executor, still the same seqquential treeAggreagate. I set the spark cluster as mentioned here: http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/

Upvotes: 0

Views: 540

Answers (1)

ganeiy
ganeiy

Reputation: 302

Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)

This has been answered already. See link below for more details. When does an action not run on the driver in Apache Spark?

Upvotes: 0

Related Questions