Reputation: 29557
Spark 2.1.x here. I have a Spark cluster configured to read/write to/from Amazon S3. I can accomplish this successfully like so:
val s3Path = "/mnt/myAwsBucket/some/*.json"
val ds = spark.read.json(s3Path)
So far so good -- if there are multiple JSON files at that location it reads all of them into a single Dataset
. I'm looking to somehow obtain the Last Modified timestamp on each JSON file that I read and store it in an array of datetimes. Hence if there are 20 JSON files that I'm reading, I'd end up with an array with 20 datetimes inside it.
Any idea how I can do this? Looking at Spark API docs I'm not seeing any way to query S3 objects for their metadata...
Upvotes: 0
Views: 2042
Reputation: 13154
You do not query s3
information through the Spark API, but rather through the AWS S3 SDK. You can do it like this:
import com.amazonaws.services.s3.AmazonS3Client
val lastModified = new AmazonS3Client().getObject("myBucket","path/to/file").getObjectMetadata.getLastModified
Obviously, you will have to download the AWS S3 SDK through Maven and include the dependency. Also I think they might have deprecated the AmazonS3Client
in newer versions of the SDK, so you may need to make slight changes depending on which version of the SDK you download:)
Upvotes: 1