Reputation: 2208
I have a large parquet file that is written to daily and partitioned by snapshot date (in long form). I am trying to write an app that takes a date and a lookback value as input, and returns a slice of the parquet from the snapshot day to x days back.
I found a similar question that had an answer suggesting I use
spark.read.parquet("gs://parquet-storage-bucket/parquet-name/snapshot_date=[1564704000-1567123200]")
However Spark seems to take this literally and cannot find a parquet with this exact name (obviously).
Is there a way I can provide a start and end date (in long format) and have all partitions data within this range retrieved?
Upvotes: 2
Views: 3886
Reputation: 591
You could try filtering the Dataset using filter
function:
spark.read.parquet("gs://parquet-storage-bucket/parquet-name")
.filter(col("snapshot_date") >= 1564704000 && col("snapshot_date") <= 1567123200)
Upvotes: 3