peleitor
peleitor

Reputation: 469

Hive partitions to Spark partitions

We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.

But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.

As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?

Upvotes: 0

Views: 1117

Answers (1)

Ged
Ged

Reputation: 18013

Partition Discovery --> might be what you are looking for:

" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "

Upvotes: 1

Related Questions