Reputation: 305
I have a code that reads the HBase table, make it nicely formatted, and then convert it to a DataFrame:
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
val tableName = "my_table"
val conf = HBaseConfiguration.create()
// Add local HBase conf
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
admin.isTableAvailable(tableName)
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
case class MyClass(srcid: Long, srcLat: Double, srcLong: Double, dstid: Long, dstLat: Double, dstLong: Double, time: Int, duration: Integer )
val parsed = hBaseRDD.map{ case(b, a) => val iter = a.list().iterator();
( Bytes.toString(a.getRow()).toLong,
Bytes.toString( iter.next().getValue()).toDouble,
Bytes.toString(iter.next().getValue()).toDouble,
Bytes.toString(iter.next().getValue()).toLong,
Bytes.toString(iter.next().getValue()).toDouble,
Bytes.toString(iter.next().getValue()).toDouble,
Bytes.toString(iter.next().getValue()).toInt,
Bytes.toString(iter.next().getValue())
)}.map{ s =>
val time = s._8.replaceAll( "T", "")
val time2 = time.replaceAll( "\\+03:00", "")
val format = new java.text.SimpleDateFormat("yyyy-MM-ddHH:mm:ss.SSS")
val date = format.parse(time2)
MyClass( s._1,
s._5,
s._6,
s._4,
s._2,
s._3,
date.getHours(),
//s(6),
s._7) }.toDF()
parsed.registerTempTable("my_table")
This code works nicely in spark-shell. However I want to use this inside a Zeppelin notebook. I was expecting this to work on the paragrah well. However when I run the code, it outputs the following error in the import statements:
<console>:28: error: object hbase is not a member of package org.apache.hadoop
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
Do I need to add a dependency to use HBase with Spark in Zeppelin. If so, how can I do it?
Upvotes: 1
Views: 2297
Reputation: 16076
Add dependency to HBase as described in documentation: http://zeppelin.apache.org/docs/0.6.0/manual/dependencymanagement.html
You will need org.apache.hbase:hbase:1.2.3
Also, you may be interested in Zeppelin HBase interpreter to run HBase queries directly from Zeppelin. However it out of topic of this question
Upvotes: 1