How to run a Java parallel algorithm on Google Dataproc cluster?

Question

I have a simple java parallel algorithm implemented using Spark. But I am not sure how I can run it on Google Dataproc cluster. I found a lot of resources online that uses python or scala but not enough for java. here is the code

public class Prime {

    List primes = new ArrayList<>();

    //Method to calculate and count the prime numbers
    public void countPrime(int n){
        for (int i = 2; i < n; i++){
            boolean  isPrime = true;

            //check if the number is prime or not
            for (int j = 2; j < i; j++){
                if (i % j == 0){
                    isPrime = false;
                    break;  // exit the inner for loop
                }
            }

            //add the primes into the List
            if (isPrime){
                primes.add(i);
            }
        }
    }

    //Main method to run the program
    public static void main(String[]args){
        //creating javaSparkContext object
        SparkConf conf = new SparkConf().setAppName("haha").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        //new prime object
        Prime prime = new Prime();
        prime.countPrime(100000);

        //parallelize the collection
        JavaRDD rdd = sc.parallelize(prime.primes , 4);
        long count = rdd.filter(e  -> e == 2|| e % 2 != 0).count(); 
    }
}

Dennis Huo · Accepted Answer

If you have a jarfile that specifies "Prime" as the main-class already, then at a basic level it's as simple as:

gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jar prime-jarfile.jar

If you have a jarfile that doesn't specify the main-class, you can submit the jarfile as "--jars" (with an 's' at the end) and specify the "--class" instead:

gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jars prime-jarfile.jar --class Prime

Note, however, since you specify setMaster("local"), that overrides the cluster's own spark environment settings, and it will only run using threads on the master node. You simply need to remove the .setMaster("local") entirely and it will automatically pick up the YARN configuration within the Dataproc cluster to actually run on multiple worker nodes.

Also, I realize this is just a getting started exercise so it probably doesn't matter but you almost certainly won't see any "speedup" in the real distributed mode because:

The computation that uses Spark is too "cheap" compared to even the time it takes to load an integer.
The number of elements being processed is too small compared to overhead of starting remote execution
The number of partitions (4) will probably be too small for dynamic executor allocation to kick in, so they might just end up running mostly one after another

So you might see more "interesting" results if for example the numbers you parallelize just each represent large "ranges" for the worker to check; for example, if the number "0" means "count primes between 0 and 1,000,000", "1" means "count primes between 1,000,000 and 2,000,000", etc. Then you might have something like:

// Start with rdd is just parallelize the numbers 0 through 999 inclusive with something like 100 to 1000 "slices".
JavaRDD countsPerRange = rdd.map(e -> countPrimesInRange(e*1000000, (e+1)*1000000));
int totalCount = countsPerRange.reduce((a, b) -> a + b);

How to run a Java parallel algorithm on Google Dataproc cluster?

Answers (1)

Related Questions