Dyin
Dyin

Reputation: 5366

Hadoop 2.5.2 on Mesos 0.21.0 - Failed to fetch URIs for container

I'm trying to run a simple WordCount example on Mesos with Hadoop 2.5.2. I've successfully set up HDFS (actually got a YARN set up behind this and it is working fine). Mesos master is running and got 4 slaves connected to it. The Hadoop library for Mesos is 0.0.8.

The configuration for Hadoop 2.5.2 is (mapred-site.xml):

<configuration>
        <property>
                <name>mapred.job.tracker</name>
                <value>*.*.*.*:9001</value>
        </property>
        <property>
                <name>mapred.job.tracker.http.address</name>
                <value>*.*.*.*:50030</value>
        </property>
        <property>
                <name>mapred.jobtracker.taskScheduler</name>
                <value>org.apache.hadoop.mapred.MesosScheduler</value>
        </property>
        <property>
                <name>mapred.mesos.taskScheduler</name>
                <value>org.apache.hadoop.mapred.JobQueueTaskScheduler</value>
        </property>
        <property>
                <name>mapred.mesos.master</name>
                <value>*.*.*.*:5050</value>
        </property>
        <property>
                <name>mapred.mesos.executor.uri</name>
                <value>hdfs://*.*.*.*:9000/hadoop-2.5.0-cdh5.2.0.tgz</value>
        </property>
</configuration>

I've got the following logs from all my slaves (example):

dbpc42: I1202 00:03:12.066195 11232 launcher.cpp:137] Forked child with pid '18714' for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4'

dbpc42: I1202 00:03:12.068272 11232 containerizer.cpp:571] Fetching URIs for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4' using command '/opt/mesos-0.21.0/build/src/mesos-fetcher'

dbpc42: I1202 00:03:12.140894 11226 containerizer.cpp:946] Destroying container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4'

dbpc42: E1202 00:03:12.141315 11229 slave.cpp:2787] Container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4' for executor 'executor_Task_Tracker_93' of framework '20141201-225046-698725789-5050-19765-0003' failed to start: Failed to fetch URIs for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4': exit status 256

dbpc42: I1202 00:03:12.242033 11231 containerizer.cpp:1117] Executor for container 'c10c2d2b-bf4b-469b-97a2-60c9720773b4' has exited dbpc42: I1202 00:03:12.243896 11225 slave.cpp:2898] Executor 'executor_Task_Tracker_93' of framework 20141201-225046-698725789-5050-19765-0003 exited with status 1

Job tracker running fine, with the hadoop jar command the job stucks at map 0% reduce 0%. In mesos cluster information the TASKS_LOST counter goes all the way up until I kill the job. Mesos and the JobTracker runs as root, the job runs as user hdfs.

What is this URI problem all about?

Thank you for your kind help or hint!

(I'll provide more information if needed.)

UPDATE

Starting a slave on the same PC where the master runs will get tasks to the staging status. 5, each time.

The mapred-mesos.executor.uri has been changed from the IP to dbpc41 (master PC).

<property>
     <name>mapred.mesos.executor.uri</name>
     <value>hdfs://dbpc41:9000/hadoop-2.5.0-cdh5.2.0.tgz</value>
</property

The other 4 slaves are still losing tasks due to (probably) unable to fetch the executor URI.

These are the logs from the 5th slave running on the same PC where the master do:

I1202 16:17:57.434345 1405 containerizer.cpp:571] Fetching URIs for container '5f33123b-00eb-4e05-9dcc-30f16f5eee44' using command '/opt/mesos-0.21.0/build/src/mesos-fetcher' I1202 16:18:08.620708 1412 slave.cpp:2840] Monitoring executor 'executor_Task_Tracker_445' of framework '20141201-225046-698725789-5050-19765-0012' in container '5f33123b-00eb-4e05-9dcc-30f16f5eee44' I1202 16:18:09.022902 1407 containerizer.cpp:1117] Executor for container '5f33123b-00eb-4e05-9dcc-30f16f5eee44' has exited I1202 16:18:09.022964 1407 containerizer.cpp:946] Destroying container '5f33123b-00eb-4e05-9dcc-30f16f5eee44' W1202 16:18:11.369912 1407 containerizer.cpp:888] Skipping resource statistic for container 5f33123b-00eb-4e05-9dcc-30f16f5eee44 because: Failed to get usage: No process found at 11093 W1202 16:18:11.369971 1407 containerizer.cpp:888] Skipping resource statistic for container 5f33123b-00eb-4e05-9dcc-30f16f5eee44 because: Failed to get usage: No process found at 11093 I1202 16:18:11.399648 1412 slave.cpp:2898] Executor 'executor_Task_Tracker_445' of framework 20141201-225046-698725789-5050-19765-0012 exited with status 1 I1202 16:18:11.401949 1412 slave.cpp:2215] Handling status update TASK_LOST (UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445 of framework 20141201-225046-698725789-5050-19765-0012 from @0.0.0.0:0 W1202 16:18:11.402245 1409 containerizer.cpp:852] Ignoring update for unknown container: 5f33123b-00eb-4e05-9dcc-30f16f5eee44 I1202 16:18:11.403017 1410 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445 of framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.403437 1406 slave.cpp:2458] Forwarding the update TASK_LOST (UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445 of framework 20141201-225046-698725789-5050-19765-0012 to [email protected]:5050 I1202 16:18:11.448752 1409 status_update_manager.cpp:389] Received status update acknowledgement (UUID: 959709c2-5546-41fd-9af3-09f024bb6354) for task Task_Tracker_445 of framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.449354 1408 slave.cpp:3007] Cleaning up executor 'executor_Task_Tracker_445' of framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.449707 1405 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_445/runs/5f33123b-00eb-4e05-9dcc-30f16f5eee44' for gc 6.99999479755852days in the future I1202 16:18:11.450034 1409 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_445' for gc 6.9999947929037days in the future I1202 16:18:11.450147 1408 slave.cpp:3084] Cleaning up framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.450213 1406 status_update_manager.cpp:279] Closing status update streams for framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:11.450381 1412 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012' for gc 6.99999478812444days in the future I1202 16:18:12.441505 1405 slave.cpp:1083] Got assigned task Task_Tracker_472 for framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.442337 1405 gc.cpp:84] Unscheduling '/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012' from gc I1202 16:18:12.442617 1405 slave.cpp:1193] Launching task Task_Tracker_472 for framework 20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.444263 1405 slave.cpp:3997] Launching executor executor_Task_Tracker_472 of framework 20141201-225046-698725789-5050-19765-0012 in work directory '/tmp/mesos/slaves/20141201-225046-698725789-5050-19765-S4/frameworks/20141201-225046-698725789-5050-19765-0012/executors/executor_Task_Tracker_472/runs/2310c642-02bf-401b-954c-876c88675c31' I1202 16:18:12.444756 1405 slave.cpp:1316] Queuing task 'Task_Tracker_472' for executor executor_Task_Tracker_472 of framework '20141201-225046-698725789-5050-19765-0012 I1202 16:18:12.444793 1406 containerizer.cpp:424] Starting container '2310c642-02bf-401b-954c-876c88675c31' for executor 'executor_Task_Tracker_472' of framework '20141201-225046-698725789-5050-19765-0012' I1202 16:18:12.447434 1406 launcher.cpp:137] Forked child with pid '11549' for container '2310c642-02bf-401b-954c-876c88675c31' I1202 16:18:12.448652 1406 containerizer.cpp:571] Fetching URIs for container '2310c642-02bf-401b-954c-876c88675c31' using command '/opt/mesos-0.21.0/build/src/mesos-fetcher'

Upvotes: 1

Views: 1940

Answers (2)

Dyin
Dyin

Reputation: 5366

Checked executor logs (stderr in /tmp/mesos/slaves/...) and found out that JAVA_HOME was not set, so the hadoop dfs command was not able to run to fetch the executor. The URI was perfect, the JAVA_HOME was not set. Additionally I had to set HADOOP_HOME when starting the slaves.

Upvotes: 2

Adam
Adam

Reputation: 4322

Looks like the Mesos slave cannot fetch one of the URIs, likely the executor itself.

Did you upload your modified Hadoop on Mesos distribution (including the hadoop-mesos-0.0.8.jar) to hdfs://*.*.*.*:9000/hadoop-2.5.0-cdh5.2.0.tgz as specified by mapred.mesos.executor.uri? Is it accessible from the slave?

Upvotes: 1

Related Questions