Reputation: 67
I have a simple java parallel algorithm implemented using Spark. But I am not sure how I can run it on Google Dataproc cluster. I found a lot of resources online that uses python or scala but not enough for java. here is the code
public class Prime {
List<Integer> primes = new ArrayList<>();
//Method to calculate and count the prime numbers
public void countPrime(int n){
for (int i = 2; i < n; i++){
boolean isPrime = true;
//check if the number is prime or not
for (int j = 2; j < i; j++){
if (i % j == 0){
isPrime = false;
break; // exit the inner for loop
}
}
//add the primes into the List
if (isPrime){
primes.add(i);
}
}
}
//Main method to run the program
public static void main(String[]args){
//creating javaSparkContext object
SparkConf conf = new SparkConf().setAppName("haha").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
//new prime object
Prime prime = new Prime();
prime.countPrime(100000);
//parallelize the collection
JavaRDD<Integer> rdd = sc.parallelize(prime.primes , 4);
long count = rdd.filter(e -> e == 2|| e % 2 != 0).count();
}
}
Upvotes: 2
Views: 259
Reputation: 10687
If you have a jarfile that specifies "Prime" as the main-class already, then at a basic level it's as simple as:
gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jar prime-jarfile.jar
If you have a jarfile that doesn't specify the main-class, you can submit the jarfile as "--jars" (with an 's' at the end) and specify the "--class" instead:
gcloud dataproc jobs submit spark --cluster ${CLUSTER_NAME} --jars prime-jarfile.jar --class Prime
Note, however, since you specify setMaster("local")
, that overrides the cluster's own spark environment settings, and it will only run using threads on the master node. You simply need to remove the .setMaster("local")
entirely and it will automatically pick up the YARN configuration within the Dataproc cluster to actually run on multiple worker nodes.
Also, I realize this is just a getting started exercise so it probably doesn't matter but you almost certainly won't see any "speedup" in the real distributed mode because:
So you might see more "interesting" results if for example the numbers you parallelize just each represent large "ranges" for the worker to check; for example, if the number "0" means "count primes between 0 and 1,000,000", "1" means "count primes between 1,000,000 and 2,000,000", etc. Then you might have something like:
// Start with rdd is just parallelize the numbers 0 through 999 inclusive with something like 100 to 1000 "slices".
JavaRDD<Integer> countsPerRange = rdd.map(e -> countPrimesInRange(e*1000000, (e+1)*1000000));
int totalCount = countsPerRange.reduce((a, b) -> a + b);
Upvotes: 1