Data locality with Spark standalone and HDFS

Question

I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.

Is there any option I should configure to tell Spark to start the tasks where the data is ?

Data locality with Spark standalone and HDFS

Answers (1)

Related Questions