Reputation: 1149

R + Hadoop with RHadoop job fails on Single Machine Cluster

Apologies in advance for being a newbie and perhaps asking stupid questions. I have installed Hadoop on a Single Machine Cluster (Ubuntu 14.04) and successfully tested the very basic program specified in the Apache installation guide. Subsequently I installed R, RStudio, and the packages rhdfs, rmr2 and all dependencies.

Then I have tried to run the following program :

Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar")
library('rhdfs')
library('rmr2')
hdfs.init()
small.ints = to.dfs(1:10)
mapreduce(
  input = small.ints, 
  map = function(k, v)
  {
    lapply(seq_along(v), function(r){
      x <- runif(v[[r]])
      keyval(r,c(max(x),min(x)))
    })})

the job fails and the ouput on the console is as follows

packageJobJar: [/tmp/RtmprPBBS1/rmr-local-env242520fb4125, /tmp/RtmprPBBS1/rmr-global-env24252518202b, /tmp/RtmprPBBS1/rmr-streaming-map24255b97931e, /tmp/hadoop-hduser/hadoop-unjar4430970496737933525/] [] /tmp/streamjob6651310557292596411.jar tmpDir=null
14/05/05 09:16:08 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/05 09:16:08 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hduser/mapred/local]
14/05/05 09:16:08 INFO streaming.StreamJob: Running job: job_201405050557_0013
14/05/05 09:16:08 INFO streaming.StreamJob: To kill this job, run:
14/05/05 09:16:08 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201405050557_0013
14/05/05 09:16:08 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201405050557_0013
14/05/05 09:16:09 INFO streaming.StreamJob:  map 0%  reduce 0%
14/05/05 09:16:41 INFO streaming.StreamJob:  map 100%  reduce 100%
14/05/05 09:16:41 INFO streaming.StreamJob: To kill this job, run:
14/05/05 09:16:41 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201405050557_0013
14/05/05 09:16:41 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201405050557_0013
14/05/05 09:16:41 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201405050557_0013_m_000001
14/05/05 09:16:41 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 1

the stderror log is as follows

Error in library(functional) : there is no package called ‘functional’
No traceback available 
Error during wrapup: 
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

i have tried with a few other simple, demo programs and the result is the same. so it seems that the problem lies with my configuration.

the 'functional' package was already installed and was getting loaded automatically. even loading it manually, does not help. so that is most probably not the problem.

any help or suggestions would me gratefully accepted.

i am running Hadoop 1.2.1, R 3.0.5 and RStudio 0.98.507 on Ubuntu 14.04 in a single cluster mode Java is Oracle 7 Java version 1.7.0_55

Hadoop installation seems to be OK since my regular wordcount program is working fine.

i am getting identical results with even the simplest RHadoop demo

could this be a problem with my machine capacity ? running on a slightly high end laptop machine ? 2.8 GiB Memory and Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 processor

i have now moved to Hadoop 2.2.0 and managed to install the same using this tutorial. The demo program for calculating PI executed without errors.

Then I executed this very simple MR program

Sys.setenv(HADOOP_CMD="/usr/local/hadoop220/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop220/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")
library('rhdfs')
library('rmr2')
library('functional')
hdfs.init()
small.ints = to.dfs(1:10)
mapreduce(
  input = small.ints, 
  map = function(k, v) cbind(v, v^2))

The program executed upto line 7 but failed in the all important MR step with the following error [ only the last part of the error is shown ]

14/05/06 13:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/05/06 13:53:36 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/05/06 13:53:37 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/06 13:53:37 INFO mapreduce.JobSubmitter: number of splits:2
14/05/06 13:53:37 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/05/06 13:53:37 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/05/06 13:53:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1399363749415_0002
14/05/06 13:53:38 INFO impl.YarnClientImpl: Submitted application application_1399363749415_0002 to ResourceManager at /0.0.0.0:8032
14/05/06 13:53:38 INFO mapreduce.Job: The url to track the job: http://yantrajaal:8088/proxy/application_1399363749415_0002/
14/05/06 13:53:38 INFO mapreduce.Job: Running job: job_1399363749415_0002
14/05/06 13:53:45 INFO mapreduce.Job: Job job_1399363749415_0002 running in uber mode : false
14/05/06 13:53:45 INFO mapreduce.Job:  map 0% reduce 0%
14/05/06 13:53:57 INFO mapreduce.Job:  map 100% reduce 0%
14/05/06 13:53:57 INFO mapreduce.Job: Task Id : attempt_1399363749415_0002_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

            ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

14/05/06 13:54:31 INFO mapreduce.Job:  map 100% reduce 0%
14/05/06 13:54:32 INFO mapreduce.Job: Job job_1399363749415_0002 failed with state FAILED due to: Task failed task_1399363749415_0002_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/05/06 13:54:32 INFO mapreduce.Job: Counters: 10
    Job Counters 
        Failed map tasks=7
        Killed map tasks=1
        Launched map tasks=8
        Other local map tasks=6
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=72476
        Total time spent by all reduces in occupied slots (ms)=0
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
14/05/06 13:54:32 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 1

really at my wits end on what to do next !

any suggestions on the way forward would be gratefully received and acknowledged. my suspicion is that RHadoop is perhaps not yet comfortable with Ubuntu 14.04 but that is a guess

Upvotes: 2

Answers (4)

Shubham Pachori

Reputation: 1638

Start your terminal and login as super user or root using

sudo su root

then start R in terminal and install the rhadoop packages using following commands

install.packages(c("codetools", "R", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava")) install.packages(c("dplyr","R.methodsS3")) install.packages(c("Hmisc")) install.packages(c("caTools")) Sys.setenv(HADOOP_HOME="/usr/local/hadoop") Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoopversiomentionhere.jar")
after that install the rmr2 rhdfs2 here
after that install these downloaded source file using this command
install.packages(path_to_file, repos = NULL, type="source")
now after installing close the terminal R and then the terminal open rstudio run the R code for streaming error will be solved as the above steps will install the R libraries in global folders.

Optionally if u want u can install R itself being a super user for being on the safer side hope this helps

Upvotes: 4

Calcutta

Reputation: 1149

the problem is because when you install packages as a non-root user, they end up in a private directory. this is the cause of all the problem. solution is to be logged in as root, or super-user, and then install the packages so that they end up in the system wide R library, which in my case is /usr/lib64/R/library after this, there is no more any problem. programs will work!

Upvotes: 0

Yanchang Zhao

Reputation: 366

I sloved my problem similar to yours with method below.

Have a look at your R libraries
```
.libPaths()
```
Check which library package functional was installed to with commands below:
```
system.file(package="functional")
```
If it is installed in a personal library, instead of in a library common to all users, jobs will fail with error saying the package cannot loaded.

Hope this will help.

Cheers

Yanchang Zhao

RDataMining.com

Upvotes: 1

RockScience

Reputation: 18580

It seems to be an error with the R setup on your single machine cluster.
Is the R package functional installed on the cluster?

Upvotes: 1

R + Hadoop with RHadoop job fails on Single Machine Cluster

Answers (4)

Related Questions