motte1988
motte1988

Reputation: 31

yarn hadoop 2.4.0: info message: ipc.Client Retrying connect to server

i've searched for two days for a solution. but nothing worked.

First, i'm new to the whole hadoop/yarn/hdfs topic and want to configure a small cluster.

the message above doesn't show up everytime i run an example from the mapreduce-examples.jar sometimes teragen works, sometimes not. in some cases the whole job failed, in others the job finishes successfully. sometimes the job failes, without printing the message above.

14/06/08 15:42:46 INFO ipc.Client: Retrying connect to server: FQDN-HOSTNAME/XXX.XX.XX.XXX:53022. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

this message is print 30 times. also the port (in code example: 53022) changes with every time a job is started. if job finished succesfuly, this is print

14/06/08 15:34:20 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 running in uber mode : false
14/06/08 15:34:20 INFO mapreduce.Job:  map 100% reduce 100%
14/06/08 15:34:20 INFO mapreduce.Job: Job job_1402234146062_0002 completed successfully

if it fails,this is shown.

INFO mapreduce.Job: Job job_1402234146062_0005 failed with state FAILED due to: Task failed task_1402234146062_0005_m_000002
Job failed as tasks failed. failedMaps:1 failedReduces:0

in this case, some tasks failed. but in log files of nodemanager, datanode, resourcemanager, ... is no reason or message to find.

INFO mapreduce.Job: Task Id : attempt_1402234146062_0006_m_000002_1, Status : FAILED

Additional Information about my Configuration: used OS: centOS 6.5 Java Version: OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.nodemanager.address</name>
                <value>FQDN-HOSTNAME:8050</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                  <name>yarn.nodemanager.localizer.address</name>
                  <value>FQDN-HOSTNAME:8040</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.resource-tracker.address</name>
                  <value>FQDN-HOSTNAME:8025</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.scheduler.address</name>
                  <value>FQDN-HOSTNAME:8030</value>
        </property>
        <property>
                  <name>yarn.resourcemanager.address</name>
                  <value>FQDN-HOSTNAME:8032</value>
        </property>
</configuration>

hdfs-site.xml

    <configuration>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                   <name>dfs.permissions </name>
                   <value>false </value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///var/data/hadoop/hdfs/nn</value>
        </property>
        <property>
                <name>fs.checkpoint.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
        </property>
        <property>
                <name>fs.checkpoint.edits.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
                <name>fs.checkpoint.edits.dir</name>
                <value>file:///var/data/hadoop/hdfs/snn</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///var/data/hadoop/hdfs/dn</value>
        </property>
</configuration>

mapred-site.xml

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.cluster.temp.dir</name>
                <value>/mapred/tempDir</value>
        </property>
        <property>
                <name>mapreduce.cluster.local.dir</name>
                <value>/mapred/localDir</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>FQDN-HOSTNAME:10020</value>
        </property>
</configuration>

I hope somebody could help me. :) Thank you, Norman

Upvotes: 3

Views: 4918

Answers (7)

SHIVAM SINGH
SHIVAM SINGH

Reputation: 343

For anyone who thinks there firewall rules and yarn-site.xml, all are in place, but still start-yarn.sh not working.

Consider changing OpenJDK version. It will save you several hours.

I had everything correct, using OpenJDK-17.0.10 and I was encountering the same error.

I checked logs on my HDFS cluster http://YOUR_IP:9870, then I found the problem with resourcemanager and nodemanager. Solution to this problem is found here

I updated my OpenJDK to this configuration

OpenJDK Runtime Environment Temurin-11.0.20.1+1 (build 11.0.20.1+1)
OpenJDK 64-Bit Server VM Temurin-11.0.20.1+1 (build 11.0.20.1+1, mixed mode)

and it worked.

Upvotes: 0

user1318024
user1318024

Reputation: 396

Wow! Are these answers for real?? Talking about FQDN when the job clearly completes...as long as firewall is disabled?? And the OP even put the detailed log messages / configuration.

The problem is that yarn.app.mapreduce.am.job.client.port-range is not being honored. I'm running into it also.

Firewall off...all is well (and I can see the ephemeral ports from yarn job).

Firewall on...all times outs (eventually).

Horton completely ignores this question on other boards.

So here's a log output from a job which demonstrates the problem. In first case, I have the firewall enabled on the client(s) based on Horton's doc (along with other ports I discovered by looking very closely at my installation). You will see the process timing out...and then all of a sudden working. Because I disabled the firewall after watching the job output :)

2015-01-15 16:48:22,943 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: de-luster-l2723nraqsy5-ywhniidze3lb-qfk4asn77vc5/10.0.0.41:52015. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
2015-01-15 16:48:23,349 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /hadoop/yarn/local/usercache/l.admin/appcache/application_1420482341308_0020
2015-01-15 16:48:24,122 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-15 16:48:24,656 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2015-01-15 16:48:24,724 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@7f94ee59
2015-01-15 16:48:24,792 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=534354336, maxSingleShuffleLimit=133588584, mergeThreshold=352673888, ioSortFactor=100, memToMemMergeOutputsThreshold=100

Did ya see it?? Problem with timeout...then all of a sudden Shuffle commences. Nothing to do with FQDNs after all :)

Upvotes: 0

if you see a message like

INFO ipc.Client: Retrying connect to server: <hostname>/<ip>:<port>. Already tried 1 time(s); maxRetries=3

Need to check:

  • check your firewall between client and Node Manager
  • check yarn.app.mapreduce.am.job.client.port-range by default the he range is all possible ports

Upvotes: 0

Frank N
Frank N

Reputation: 11

This is a bug in how the MR AppMaster starts up with ephemeral ports. It exists in Hadoop 2.6.0 release version as well.
I have figured out a fix to this bug and created a JIRA on the MAPREDUCE project along with a comment on how to fix it.

https://issues.apache.org/jira/browse/MAPREDUCE-6338

Upvotes: 1

ravik
ravik

Reputation: 76

Definitely a bug, this post provides a clearer insight into what is happening. https://groups.google.com/a/cloudera.org/forum/#!msg/cdh-user/P1rfMQmYVWk/eARZXHUTkW0J

We are planning on getting around this issue by reducing the ephemeral port range, thus limiting what ports are grabbed, and then configuring iptables to allow for that port range. Setting the port ranges is explained here - http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html

Upvotes: 0

Deleteman
Deleteman

Reputation: 8690

Another possible solution for this, is to check for the firewall in all the nodes. If you're dealing with iptables, you can run this on every node:

# /etc/init.d/iptables save
# /etc/init.d/iptables stop

That will stop the firewall until next restart, but it should be enough for you to test the cluster. You don't have to restart yarn or anything, just run the job again.

If you want to completely stop the FW:

# chkconfig iptables off

Upvotes: 0

naimdjon
naimdjon

Reputation: 3602

The job finishes sometimes successfully because when you have one reducer and that reduce task by chance is sent to a working node manager then it becomes successful job.

You have to make sure that FQDN-HOSTNAME is written exactly the same way in the slaves file. If I remember correctly, my solution was that I removed the entry for the hostname mapping in /etc/hosts, that is commenting it out like this:

#127.0.0.1    FQDN-HOSTNAME

Upvotes: 1

Related Questions