Reputation:
I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD
:
val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)
and Dataset[Row]
/ Dataframe
:
df.repartition($"someColumn")
The goal is to avoid shuffle when data is restored. For example:
spark.range(n).withColumn("foo", lit(1))
.repartition(m, $"id")
.write
.partitionBy("id")
.parquet(path)
shouldn't require shuffle for:
spark.read.parquet(path).repartition(m, $"id")
I thought about writing partitioned Dataset
to Parquet but I believe that Spark doesn't use this information.
I can work only with disk storage not a database or data grid.
Upvotes: 2
Views: 1230
Reputation: 1532
It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.
Dataset<Row> parquet =...;
parquet.write()
.bucketBy(1000, "col1", "col2")
.partitionBy("col3")
.saveAsTable("tableName");
sparkSession.read().table("tableName");
Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)
Upvotes: 3