Cam
Cam

Reputation: 2208

How do I read a certain date range from a partitioned parquet file in Spark

I have a large parquet file that is written to daily and partitioned by snapshot date (in long form). I am trying to write an app that takes a date and a lookback value as input, and returns a slice of the parquet from the snapshot day to x days back.

I found a similar question that had an answer suggesting I use

spark.read.parquet("gs://parquet-storage-bucket/parquet-name/snapshot_date=[1564704000-1567123200]")

However Spark seems to take this literally and cannot find a parquet with this exact name (obviously).

Is there a way I can provide a start and end date (in long format) and have all partitions data within this range retrieved?

Upvotes: 2

Views: 3886

Answers (1)

Pritish
Pritish

Reputation: 591

You could try filtering the Dataset using filter function:

spark.read.parquet("gs://parquet-storage-bucket/parquet-name")
.filter(col("snapshot_date") >= 1564704000 && col("snapshot_date") <= 1567123200)

Upvotes: 3

Related Questions