Reputation: 469
We need to work on a big dataset with partitioned data, for efficiency reasons. Data source resides in Hive, but with a different partition criteria. In other words, we need to retrieve data from Hive to Spark, and re-partition in Spark.
But there is an issue in Spark that causes reordering/redistributing partitioning when data is persisted (either to parquet or ORC). Therefore, our new partitioning in Spark is lost.
As an alternative, we are considering building our new partitioning in a new Hive table. The question is: is it possible to map Spark partitions from Hive partitions (for read)?
Upvotes: 0
Views: 1117
Reputation: 18013
Partition Discovery --> might be what you are looking for:
" Passing the path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. "
Upvotes: 1