Query Amazon S3 Object Metadata via Spark

Question

Spark 2.1.x here. I have a Spark cluster configured to read/write to/from Amazon S3. I can accomplish this successfully like so:

val s3Path = "/mnt/myAwsBucket/some/*.json"
val ds = spark.read.json(s3Path)

So far so good -- if there are multiple JSON files at that location it reads all of them into a single Dataset. I'm looking to somehow obtain the Last Modified timestamp on each JSON file that I read and store it in an array of datetimes. Hence if there are 20 JSON files that I'm reading, I'd end up with an array with 20 datetimes inside it.

Any idea how I can do this? Looking at Spark API docs I'm not seeing any way to query S3 objects for their metadata...

Glennie Helles Sindholt · Accepted Answer

You do not query s3 information through the Spark API, but rather through the AWS S3 SDK. You can do it like this:

import com.amazonaws.services.s3.AmazonS3Client

val lastModified = new AmazonS3Client().getObject("myBucket","path/to/file").getObjectMetadata.getLastModified

Obviously, you will have to download the AWS S3 SDK through Maven and include the dependency. Also I think they might have deprecated the AmazonS3Client in newer versions of the SDK, so you may need to make slight changes depending on which version of the SDK you download:)

Query Amazon S3 Object Metadata via Spark

Answers (1)

Related Questions