How do I read a certain date range from a partitioned parquet file in Spark

Question

I have a large parquet file that is written to daily and partitioned by snapshot date (in long form). I am trying to write an app that takes a date and a lookback value as input, and returns a slice of the parquet from the snapshot day to x days back.

I found a similar question that had an answer suggesting I use

spark.read.parquet("gs://parquet-storage-bucket/parquet-name/snapshot_date=[1564704000-1567123200]")

However Spark seems to take this literally and cannot find a parquet with this exact name (obviously).

Is there a way I can provide a start and end date (in long format) and have all partitions data within this range retrieved?

Pritish · Accepted Answer

You could try filtering the Dataset using filter function:

spark.read.parquet("gs://parquet-storage-bucket/parquet-name")
.filter(col("snapshot_date") >= 1564704000 && col("snapshot_date") <= 1567123200)

How do I read a certain date range from a partitioned parquet file in Spark

Answers (1)

Related Questions