Parsing files from Amazon S3 with Apache Spark

Question

I am using Apache Spark and I have to parse files from Amazon S3. How would I know file extension while fetching the files from Amazon S3 path?

freedev · Accepted Answer

I suggest to follow Cloudera tutorial Accessing Data Stored in Amazon S3 through Spark

To access data stored in Amazon S3 from Spark applications, you could use Hadoop file APIs (SparkContext.hadoopFile, JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs, providing URLs of the form s3a://bucket_name/path/to/file.txt.

You can read and write Spark SQL DataFrames using the Data Source API.

Regarding the file extension, there are few solutions. You could simply take the extension by the filename (i.e. file.txt).

If your extensions were removed by files stored in your S3 buckets, you could still know the content-type looking at metadata added for each S3 resource.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html

Parsing files from Amazon S3 with Apache Spark

Answers (1)

Related Questions