How to specify/check # of partitions on Dataproc cluster

Question

If I spin up a Dataproc cluster of 1 master n1-standard-4 and 4 worker machines, also n1-standard-4, how do I tell how many partitions are created by default? If I want to make sure I have 32 partitions, what syntax do I use in my PySpark script? I am reading in a .csv file from a Google Storage bucket.

Is it simply

myRDD = sc.textFile("gs://PathToFile", 32)

How do I tell how many partitions are running (using Dataproc jobs output screen?

Thanks

Angus Davis · Accepted Answer

To get the number of parititons in an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.getNumPartitions

To repartition an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.repartition

How to specify/check # of partitions on Dataproc cluster

Answers (1)

Related Questions