user6022341
user6022341

Reputation:

How to reliably write and restore partitioned data

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:

val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)

and Dataset[Row] / Dataframe:

df.repartition($"someColumn")

The goal is to avoid shuffle when data is restored. For example:

spark.range(n).withColumn("foo", lit(1))
  .repartition(m, $"id")
  .write
  .partitionBy("id")
  .parquet(path)

shouldn't require shuffle for:

spark.read.parquet(path).repartition(m, $"id")

I thought about writing partitioned Dataset to Parquet but I believe that Spark doesn't use this information.

I can work only with disk storage not a database or data grid.

Upvotes: 2

Views: 1230

Answers (1)

Igor Berman
Igor Berman

Reputation: 1532

It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.

Dataset<Row> parquet =...;
parquet.write()
  .bucketBy(1000, "col1", "col2")
  .partitionBy("col3")
  .saveAsTable("tableName");

sparkSession.read().table("tableName");

Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)

Upvotes: 3

Related Questions