How to reliably write and restore partitioned data

Question

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:

val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)

and Dataset[Row] / Dataframe:

df.repartition($"someColumn")

The goal is to avoid shuffle when data is restored. For example:

spark.range(n).withColumn("foo", lit(1))
  .repartition(m, $"id")
  .write
  .partitionBy("id")
  .parquet(path)

shouldn't require shuffle for:

spark.read.parquet(path).repartition(m, $"id")

I thought about writing partitioned Dataset to Parquet but I believe that Spark doesn't use this information.

I can work only with disk storage not a database or data grid.

Answers (1)