HITANSU
HITANSU

Reputation: 51

HIVE partition & bucketing support in Spark not working as expected

When working with partitions in S3, Spark is listing down all the partitions one by one, which is consuming time.Rather it should look for the partition in the meta-store table & should go to the partition immediately. I tried with an example of 125 partitions.When I calculate the exact location of the S3 by appending the partition column value & try to access it, it executes within 5sec.But if I try to let Spark figures out the partition, it is listing down all the partitions, which itself is taking more than 30 sec. How can I let Spark figures out the partition from the meta-store using the predicate push-down?

Upvotes: 0

Views: 405

Answers (1)

Igor Berman
Igor Berman

Reputation: 1532

You need to setup external hive metastore(it can be mysql or postgres). So the definitions of tables/partitions will be persisted there and will survive different spark context lifespans

Upvotes: 1

Related Questions