iAmHereForHelp
iAmHereForHelp

Reputation: 79

Why can't spark executor communicate with the driver? Results in java.net.UnknownHostException: <driverName>.<namespace>.svc

I am running Spark Submit through a pod within cluster on Kubernetes with the following script:

/opt/spark/bin/spark-submit
--master k8s://someCluster
--deploy-mode cluster
--name someName
--class some.class
--conf spark.driver.userClassPathFirst=true         
--conf spark.kubernetes.namespace=someNamespace
--conf spark.kubernetes.container.image=someImage
--conf spark.kubernetes.container.image.pullSecrets=image-pull-secret
--conf spark.kubernetes.container.image.pullPolicy=Always
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
--conf spark.kubernetes.authenticate.driver.serviceAccountName=someServiceAccount
--conf spark.driver.port=7078
--conf spark.blockManager.port=7079
local:////someApp.jar

The script runs fine and the driver pod starts along with the auto-generated service, with ports 7078, 7079, and 4040 plus selector that matches the label that was added to the driver pod. However, the svc has no endpoints.

The executor then starts but never succeeds due to the following error below:

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:283)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:272)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:303)
        at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
        at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
        at scala.collection.immutable.Range.foreach(Range.scala:158)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:301)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        ... 4 more
Caused by: java.io.IOException: Failed to connect to drivername-svc.namespace.svc:7078
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.UnknownHostException: drivername-svc.namespace.svc
        at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
        at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)
        at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)
        at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)
        at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
        at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
        at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
        at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
        at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
        at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
        at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
        at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
        at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:200)
        at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
        at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
        at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
        at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
        at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:984)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:504)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:417)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:474)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

I also have a network policy which exposes the ports 7078 and 7079 through ingress and egress. Not sure what else I am missing.

Upvotes: 1

Views: 1379

Answers (1)

iAmHereForHelp
iAmHereForHelp

Reputation: 79

Found out the endpoint wasn't added to the service because the driver pod has multiple containers, one of them terminates early causing readiness of the pod to be "not ready"; hence the service does not register the driver pod endpoint. Since there are no endpoints for the service, the executor pods trying to communicate with the service sees no host exception.

Upvotes: 1

Related Questions