Immanuel Fredrick
Immanuel Fredrick

Reputation: 538

spark read parquet with partition filters vs complete path

I have a partitioned parquet data in hdfs example: hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23/<part-files.parquet>

I would like to understand which is the best way to read the data:

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/").where(col('hour') == "23")

OR

df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23")

I would like to understand more in terms of performance and other advantages if any.

Upvotes: 5

Views: 15354

Answers (3)

GodBlessYou
GodBlessYou

Reputation: 639

Please read through this ariticle. You can take a look at the phyical plan of the query. If you find it is using PartitionFilters, that means those two ways you described are not much different.

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87069

If you have a big hierarchy of directories/files, direct reading of single directory could be faster compared to the filtering, as Spark will need to build an index to apply that filter. See the following question & answer.

Upvotes: 1

dsk
dsk

Reputation: 2003

This is pretty straight forward, the first thing we will do while reading a file is to filter down unnecessary column using df = df.filter() this will filter down the data even before reading into memory, advanced files format like parquet, ORC supports the concept predictive push-down more here , this enables you to read data in way faster that loading the full data.

Upvotes: 2

Related Questions