Reputation: 51
When working with partitions in S3, Spark is listing down all the partitions one by one, which is consuming time.Rather it should look for the partition in the meta-store table & should go to the partition immediately. I tried with an example of 125 partitions.When I calculate the exact location of the S3 by appending the partition column value & try to access it, it executes within 5sec.But if I try to let Spark figures out the partition, it is listing down all the partitions, which itself is taking more than 30 sec. How can I let Spark figures out the partition from the meta-store using the predicate push-down?
Upvotes: 0
Views: 405
Reputation: 1532
You need to setup external hive metastore(it can be mysql or postgres). So the definitions of tables/partitions will be persisted there and will survive different spark context lifespans
Upvotes: 1