Reputation: 67
I need to test connectivity from each node in a Spark cluster to a database. Is there a way I can do it programatically? I have thought of getting hostnames of all the nodes in a Spark cluster as follows -
val data = sc.parallelize(1 to 100000000)
val hosts = data.map {
value =>
java.net.InetAddress.getLocalHost().getHostName()
}.collect()
println(s "Hostnames: ${hosts.mkString(", ")}")
But this always prints single hostname even though there are many nodes in the cluster. I am running the program in a cluster mode configuration. If I am able to get different hostnames then I can assume that this code is running on different nodes. And in the same way I can test connectivity by modifying this as follows -
val data = sc.parallelize(1 to 100000000)
val hosts = data.map {value =>
Class.forName(DRIVER)
val connection = DriverManager.getConnection(URL, USER, PASSWORD)
java.net.InetAddress.getLocalHost().getHostName()
}.collect()
println(s "Hostnames: ${hosts.mkString(", ")}")
Or is there any alternative way to do this?
Upvotes: 3
Views: 1038
Reputation: 1080
Data might be present in a single node, hence you may get the same hostname every time. Try to increase the number of partitions of your data, doing this will distribute data. Still there is no guarantee that all executors will be used. A good number of partitions is to go by this formula:
number of partitions = 3 * number of CPUs in your cluster
This will most likely distribute your data and when you run the code, you will be executing on different executors.
To increase partitions you can use:
data.repartition(<number of partitions>)
And also you could use mapPartitions function instead of map, if all you need is the name of the host in for partition.
Upvotes: 2