Reputation: 326
I'm running a Spark job on YARN and would like to get the YARN container ID (as part of a requirement to generate unique IDs across a set of Spark jobs). I can see the Container.getId() method to get the ContainerId but no idea how to get a reference to the current running container from YARN. Is this even possible? How does a YARN container get it's own information?
Upvotes: 2
Views: 2301
Reputation: 654
YARN will export all of the environment variables listed here: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java#L117
So you should be able to access it like:
sys.env.get(ApplicationConstants.Environment.CONTAINER_ID.toString)
// or, equivalently
sys.env.get("CONTAINER_ID")
Upvotes: 0
Reputation: 1409
here below description how spark store the Container ID
Spark hide the container id and expose the executor id per application/job so if you are planning to maintain the unique id per spark job, my suggestion to use application id which spark gives you then you can add your some string to make unique for you
below spark code from "YarnAllocator.scala"
private[yarn] val executorIdToContainer = new HashMap[String, Container]
Upvotes: 1
Reputation: 326
The only way that I could get something was to use the logging directory. The following works in a spark shell.
import org.apache.hadoop.yarn.api.records.ContainerId
def f(): String = {
val localLogDir: String = System.getProperty("spark.yarn.app.container.log.dir")
val containerIdString: String = localLogDir.split("/").last
val containerIdLong: Long = ContainerId.fromString(containerIdString).getContainerId
containerIdLong.toHexString
}
val rdd1 = sc.parallelize((1 to 10)).map{ _ => f() }
rdd1.distinct.collect().foreach(println)
Upvotes: 3