Reputation: 65
I'm running a cloudera cluster in 3 virtual maschines and try to execute hbase bulk load via a map reduce job. But I got always the error:
error: Class org.apache.hadoop.hbase.mapreduce.HFileOutputFormat not found
So, it seems that the map process doesnt find the class. So I tried this:
1) add the hbase.jar to the HADOOP_CLASSPATH on every node
2) adding TableMapReduceUtil.addDependencyJars(job) / TableMapReduceUtil.addDependencyJars(myConf, HFileOutputFormat.class) to my source code
nothing worked. I have absolute no idea why the class is not found, because the jar/class is definitely available in the classpath.
If I take a look into the job.xml I see the following entry:
name=tmpjars value=file:/C:/Users/Thomas/.m2/repository/org/apache/zookeeper/zookeeper/3.4.5-cdh4.3.0/zookeeper-3.4.5-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hbase/hbase/0.94.6-cdh4.3.0/hbase-0.94.6-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.3.0/hadoop-core-2.0.0-mr1-cdh4.3.0.jar,file:/C:/Users/Thomas/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar,file:/C:/Users/Thomas/.m2/repository/com/google/protobuf/protobuf-java/2.4.0a/protobuf-java-2.4.0a.jar
This seems a little bit odd to me, these are my local jars on the windows system. Maybe this should be the hdfs jars? If yes, how can I change the values for "tmpjars"?
Here is the java code I try to execute:
configuration = new Configuration(false);
configuration.set("mapred.job.tracker", "192.168.2.41:8021");
configuration.set("fs.defaultFS", "hdfs://192.168.2.41:8020/");
configuration.set("hbase.zookeeper.quorum", "192.168.2.41");
configuration.set("hbase.zookeeper.property.clientPort", "2181");
Job job = new Job(configuration, "HBase Bulk Import for "
+ tablename);
job.setJarByClass(HBaseKVMapper.class);
job.setMapperClass(HBaseKVMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path("myfile1"));
FileOutputFormat.setOutputPath(job, new Path("myfile2"));
job.waitForCompletion(true);
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
configuration);
loader.doBulkLoad(new Path("myFile3"), hTable);
EDIT:
I tried a little bit more and its totaly strange. I add the following line to the java code:
job.setJarByClass(HFileOutputFormat.class);
after I executed this, the error is gone, but another class not found exception appear:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mypackage.bulkLoad.HBaseKVMapper not found
HBaseKVMapper is my custom Mapper Class I want to execute. So, I tried to add it with "job.setJarByClass(HBaseKVMapper.class)", but it doesnt work since its its only a class file and no jar. So I generated a Jarfile including HBaseKVMapper.class. After that, I executed it again and now got the HFileOutputFormat.class not found exception again.
After debugging a little bit, I found out that the setJarByClass Methode only copies the local jar file to .staging/job_#number/job.jar on HDFS. So, this setJarByClass() Method will only work for one jar file because it overwrites the job.jar after executing setJarByClass() again with another jar.
While searching for the eroor I saw the following strcuture in the the job staging direcotry:
and inside the libjars direcotry I saw the relevant jar files
so, the hbase jar is inside the libjars directory but the jobtracker doesn't use this it for executing the job. Why?
Upvotes: 2
Views: 6326
Reputation: 287
I would try using Cloudera Manager (free version) as it takes care of these issues for you. Otherwise note the following:
Both your own classes and the HBase Class HFileOutputFormat need to be available on the classpath locally and remotely.
Submitting the job
Meaning getting the classpath right locally for when your driver runs:
$ env HADOOP_CLASSPATH=$(hbase classpath) hadoop jar path/to/jar class....
On the server
In your hadoop-env.sh
export HADOOP_CLASSPATH=$(hbase claspath)
or use
TableMapReduceUtil.addDependencyJars
Upvotes: 1
Reputation: 65
I found a "hacked" solution which worked for me, but I'm not happy with it because it's not really practicable.
My "hacked" solution:
When I execute this the big.jar will be copied for every job to the filesystem. No errors anymore. The problem is, that the jar is 80mb in size and have to be copied every time.
If anywone knows a better way I would be tahnkful if he could tell me how.
EDIT:
Now I try to execute jobs with Apache Pig and have exactly the same problem. My hacked soultion doesn't work in this case because pig creats the jobs automaticly. Here is the pig error:
java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.mapreduce.TableSplit not found
Upvotes: 0