Reputation: 2372
I am reading the documentation both for Google Cloud Dataproc and generally for Apache Spark and am unable to figure out how to manually set the number of partitions when using the Bigquery connector.
The HDD is created using newAPIHadoopRDD and my strong suspicion is that this can be set via the config file which is passed to this function. But I can't actually figure out what the possible values are for the config file. Neither the Spark documentation or the Google documentation seems to specify or link to the Hadoop job configuration file specification.
Is there a way to set the partitions upon the creation of this RDD or do I just need to repartition it as the next step?
Upvotes: 0
Views: 386
Reputation: 68
you need to do repartition in your spark code, for example :
val REPARTITION_VALUE = 24
val rdd = sc.newAPIHadoopRDD(conf,classOf[GsonBigQueryInputFormat],classOf[LongWritable],classOf[JsonObject])
rdd.map(x => f(x))
.repartition(REPARTITION_VALUE)
.groupBy(_.1)
.map(tup2 => f(tup2._1,tup2._2.toSeq))
.repartition(REPARTITION_VALUE)
And so on ...
when you work with rdd you will need to handle the partition
Solution : the best solution is to work with the Dataset or DataFram
Upvotes: 2