Apache Spark Correlation only runs on driver

Question

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.

I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

How could I find what part of the correlation has happened at the driver and what at executor?

Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's. Look here for the images from the SparK web UI: Distributed cross correlation matrix computation

Update 2

I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor. submitting the job like this ./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py from master node

My correlation sample file is correlations_example.py:

data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose()) 
print(Statistics.corr(data, method="pearson")) 
sc.stop()

I always get a sequential timeline as :

Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?

Update 3: I tried even adding another executor, still the same seqquential treeAggreagate. I set the spark cluster as mentioned here: http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/

Apache Spark Correlation only runs on driver

Answers (1)

Related Questions