Reputation: 538
I have a partitioned parquet data in hdfs example: hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23/<part-files.parquet>
I would like to understand which is the best way to read the data:
df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/").where(col('hour') == "23")
OR
df = spark.read.parquet("hdfs://cluster/stage/data/datawarehouse/table=metrics_data/country=india/year=2020/month=06/day=30/hour=23")
I would like to understand more in terms of performance and other advantages if any.
Upvotes: 5
Views: 15354
Reputation: 639
Please read through this ariticle. You can take a look at the phyical plan of the query. If you find it is using PartitionFilters, that means those two ways you described are not much different.
Upvotes: 0
Reputation: 87069
If you have a big hierarchy of directories/files, direct reading of single directory could be faster compared to the filtering, as Spark will need to build an index to apply that filter. See the following question & answer.
Upvotes: 1
Reputation: 2003
This is pretty straight forward, the first thing we will do while reading a file is to filter down unnecessary column using df = df.filter()
this will filter down the data even before reading into memory, advanced files format like parquet, ORC supports the concept predictive push-down more here , this enables you to read data in way faster that loading the full data.
Upvotes: 2