Reputation: 3100
When we create the RDD using textFile function based on HDFS it will create the partitions according to the blocks and computation will generally happen where the data resides onto the data node.
However, when we create RDD based on S3 files, how data will be transferred from S3 bucket to Spark workers for execution? does transfer involves driver as well? Also will their be any performance implications while using the S3 as storage as compared to HDFS.
Regards,
Neeraj
Upvotes: 2
Views: 571
Reputation: 18003
As you imply no data locality with S3.
Just need splitable format for Workers to get data from.
Hence, S3 is slower, but cheaper.
No NameNode req'd.
Driver is only required for things like collect and coordinating tasks to Workers/Executors. Would not make sense architecturally.
Upvotes: 4