Joe C
Joe C

Reputation: 2827

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.

My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.

By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do

 GET http://<rm http address:port>/ws/v1/cluster/nodes

to get the information of each node like

<node>
    <rack>/default-rack</rack>
    <state>RUNNING</state>
    <id>host1.domain.com:54158</id>
    <nodeHostName>host1.domain.com</nodeHostName>
    <nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
    <lastHealthUpdate>1476995346399</lastHealthUpdate>
    <version>3.0.0-SNAPSHOT</version>
    <healthReport></healthReport>
    <numContainers>0</numContainers>
    <usedMemoryMB>0</usedMemoryMB>
    <availMemoryMB>8192</availMemoryMB>
    <usedVirtualCores>0</usedVirtualCores>
    <availableVirtualCores>8</availableVirtualCores>
    <resourceUtilization>
        <nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
        <nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
        <nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
        <aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
        <aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
        <containersCPUUsage>0.0</containersCPUUsage>
    </resourceUtilization>
  </node>

If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?

I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?

Is there any other way to check if a machine in a Spark cluster participates a job?

Upvotes: 1

Views: 182

Answers (1)

Tej
Tej

Reputation: 235

I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster

I will try to explain the latter point.

Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster. The node removed can be an executor or an application master.

  1. If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :

yarn.resourcemanager.am.max-attempts

By default, this value is 2

  1. If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.

As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job. When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.

In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

Upvotes: 1

Related Questions