pgrandjean
pgrandjean

Reputation: 766

Spark cluster mode & threads

I am launching a spark application (2.1.0) in yarn-cluster mode from a gateway, with options --master yarn --deploy-mode cluster. However, I see that the process launched by spark-submit on the gateway still creates hundreds of threads locally. I would expect that threads would be created on a worker node, not in the gateway, since the cluster mode is activated. The cluster mode is confirmed by the logs. Why would hundreds of threads being launched in the gateway?

PS: I am using an encrypted cluster running Hadoop 2.6.0.

Upvotes: 0

Views: 649

Answers (1)

mauriciojost
mauriciojost

Reputation: 370

You may be experiencing the issue reported here https://issues.apache.org/jira/browse/HDFS-7718 . This issue seems to have impacted us in my company, on a Cloudera cluster with kerberos enabled, using cluster deployment mode to decrease resources consumption on the node launching spark-submit. We would see that launching a few Spark jobs from our gateway node would result in errors like:

java.lang.OutOfMemoryError: Unable to create new native thread

To investigate if this is indeed the issue impacting you, try launching jstack on the Yarn Application Master JVM, and see how your threads look like. If you see many threads with the following stack-trace:

"Truststore reloader thread" daemon prio=10 tid=0x00007fd1a5fa4000 nid=0x46f5 waiting on condition [0x00007fd086eed000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run(ReloadingX509TrustManager.java:189)
    at java.lang.Thread.run(Thread.java:745)

you are very eligible.

In our case, when using spark.yarn.jars in our secured cluster, our ApplicationMaster would increase by 1 the amount of threads used every time a new jar was analysed to be cached in HDFS. Each new thread would have the above described stack-trace. In our case we had hdfs.DFSClient instances creating a new KMSClientProvider, creating a new ReloadingX509TrustManager, that creates new threads, one per cached jar. A simple workaround that worked for us was avoiding the use of spark.yarn.jars.

For completeness, you may want to take a look too at this issue https://issues.apache.org/jira/browse/HADOOP-11368.

Upvotes: 1

Related Questions