samba
samba

Reputation: 3111

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:

myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.

Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)

Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.

If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.

How is it possible to achieve this in a Databricks notebook without hardcoding the hour?

Upvotes: 0

Views: 309

Answers (1)

justcode
justcode

Reputation: 128

Use timedate function

from datetime import datetime, timedelta

latest_hour = datetime.now() - timedelta(hours = 1)

You can also split them by year, month, day, hour

latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Upvotes: 1

Related Questions