Thom Rogers
Thom Rogers

Reputation: 1433

How to specify/check # of partitions on Dataproc cluster

If I spin up a Dataproc cluster of 1 master n1-standard-4 and 4 worker machines, also n1-standard-4, how do I tell how many partitions are created by default? If I want to make sure I have 32 partitions, what syntax do I use in my PySpark script? I am reading in a .csv file from a Google Storage bucket.

Is it simply

myRDD = sc.textFile("gs://PathToFile", 32)

How do I tell how many partitions are running (using Dataproc jobs output screen?

Thanks

Upvotes: 3

Views: 401

Answers (1)

Related Questions