Reputation: 993
I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
Upvotes: 7
Views: 4368
Reputation: 35404
When
FileSystem.open(path)
gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is alwaysPROCESS_LOCAL
-- vanekjar has already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Upvotes: 6
Reputation: 139
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).
Upvotes: 2