Running Spark application using HDFS or S3

Question

In my spark application, I just want to access a big file, and distribute the computation across many nodes on EC2.

Initially, my file is stored on S3.

It's very convenient for me to load the file with sc.textFile() function from S3.

However, I can put some efforts to load the data to HDFS and then read the data from there.

My question is, will the performance be better with HDFS?

My code involves the spark partitions(mapPartitions transforamtion), so does it really matter what is my initial file system?

Answers (1)