LonelySoul
LonelySoul

Reputation: 1232

Hadoop streaming fails in R

I am running the sample script of RHadoop to test out the system and using the following commands.

library(rmr2)
library(rhdfs)
Sys.setenv(HADOOP_HOME="/usr/bin/hadoop")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-mapreduce/hadoop-streaming.jar")
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

But it's giving me an error like below.

>Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1587)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1611)
13/08/21 18:30:25 INFO mapred.JobClient: Job complete: job_201308191923_0307
13/08/21 18:30:25 INFO mapred.JobClient: Counters: 7
13/08/21 18:30:25 INFO mapred.JobClient:   Job Counters
13/08/21 18:30:25 INFO mapred.JobClient:     Failed map tasks=1
13/08/21 18:30:25 INFO mapred.JobClient:     Launched map tasks=8
13/08/21 18:30:25 INFO mapred.JobClient:     Data-local map tasks=8
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=46647
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/08/21 18:30:25 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/21 18:30:25 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, in.folder = if (is.list(input)) { :
  hadoop streaming failed with error code 1

Any lead about what may be wrong here

Upvotes: 1

Views: 1820

Answers (1)

piccolbo
piccolbo

Reputation: 1315

HADOOP_HOME should be a directory. HADOOP_CMD should be a program. So since they are set to the same thing, that's wrong right there. But HADOOP_CMD should supersede HADOOP_HOME so that shouldn't be the root cause. So the only option left is debugging. If you had read the debugging guide you would have digged out stderr and would know a lot more already. With the console output only, there's nothing to work on.

Upvotes: 2

Related Questions