Reputation: 1433
If I spin up a Dataproc cluster of 1 master n1-standard-4 and 4 worker machines, also n1-standard-4, how do I tell how many partitions are created by default? If I want to make sure I have 32 partitions, what syntax do I use in my PySpark script? I am reading in a .csv file from a Google Storage bucket.
Is it simply
myRDD = sc.textFile("gs://PathToFile", 32)
How do I tell how many partitions are running (using Dataproc jobs output screen?
Thanks
Upvotes: 3
Views: 401
Reputation: 2683
To get the number of parititons in an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.getNumPartitions
To repartition an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.repartition
Upvotes: 3