Reputation: 537
I have a Job that need to access parquet files on HDFS and I would like to minimise the network activity. So far I have HDFS Datanodes and Spark Workers started on the same nodes, but when I launch my job the data locality is always at ANY where it should be NODE_LOCAL since the data is distributed among all the nodes.
Is there any option I should configure to tell Spark to start the tasks where the data is ?
Upvotes: 2
Views: 623
Reputation: 380
The property you are looking for is spark.locality.wait
. If you increase its value it will execute jobs more locally, as spark wont send the data to other workers just because the one is busy on which the data resides. Although, setting the value to high might result in longer execution times cause you do not utilise workers efficiently.
Also have a look here: http://spark.apache.org/docs/latest/configuration.html
Upvotes: 3