Reputation: 45692
Could you please help me to find Java API for repartitioning sales
dataset to N
patitions of equal-size? By equal-size I mean equal number of rows.
Dataset<Row> sales = sparkSession.read().parquet(salesPath);
sales.toJavaRDD().partitions().size(); // returns 1
Upvotes: 1
Views: 3998
Reputation: 412
AFAIK custom partitioners are not supported for Datasets. The whole idea of Dataset and Dataframe APIs in Spark 2+ is to abstract away the need to meddle with custom partitioners. And so if we face the need to deal with Data-skew and come to a point where custom partitioner is the only option, I guess we would go to lower level RDD manipulation.
For eg: Facebook use-case-study and Spark summit talk related to the use-case-study
For defining partitioners for RDDs, it is well documented in the API doc
Upvotes: 3