Does Hadoop's TaskTracker spawn a new JVM for each task?

Question

According to TaskTracker Hadoop Wiki page, the TaskTracker spawns a new JVM to do the actual work that it is tracking. There is a typo the in page, however, and it is not clear if the TaskTracker spawns one JVM for all tasks it is tracking, or if the TaskTracker spawns one JVM for each task it is tracking. The reason I am asking is because I am curious if using static variables to hold job level variables provides any benefit to simply instantiating a variable in the map function.

Donald Miner · Accepted Answer

It spawns one JVM for each task.

You can reuse jvms by setting this configuration parameter: mapred.job.reuse.jvm.num.tasks, but that's just to reduce JVM spin up time. Functionally, it'll still rebuild the classes so that doesn't matter to you.

If the variable is relatively small, like a string or something, you shouldn't be too worried. If it's larger you can start being worried! For example, loading up from the distributed cache a large file into a Map once per task can be expensive in aggregate. You can mitigate this by having fewer map tasks do more work per task. I've even done crazy things like store shared variables in Redis or ZooKeeper.

Does Hadoop's TaskTracker spawn a new JVM for each task?

Answers (1)

Related Questions

Does Hadoop&#39;s TaskTracker spawn a new JVM for each task?

Answers (1)

Related Questions

Does Hadoop's TaskTracker spawn a new JVM for each task?