Reputation: 415
I am trying to write to hbase table using pySpark. So far, I could able to read the data from hbase. but getting exception when writing to hbase table.
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.types import *
properties = {
"instanceId" : "hbase",
"zookeepers" : "10-x-x-x.local:2181,10-x-x-x.local:2181,10-x-x-x.local:2181",
"hbase.columns.mapping" : "KEY_FIELD STRING :key, A STRING c:a, B STRING c:b",
"hbase.use.hbase.context" : False,
"hbase.config.resources" : "file:///etc/hbase/conf/hbase-site.xml",
"hbase.table" : "t"
}
spark = SparkSession\
.builder\
.appName("hbaseWrite")\
.getOrCreate()
sc = spark.sparkContext
#I am able to read the data successfully.
#df = spark.read.format("org.apache.hadoop.hbase.spark")\
# .options( **properties)\
# .load()
data = [("3","DATA 3 A", "DATA 3 B")]
columns = ['KEY_FIELD','A','B']
cSchema = StructType([StructField(columnName, StringType()) for columnName in columns])
df = spark.createDataFrame(data, schema=cSchema)
df.write\
.options( **properties)\
.mode('overwrite').format("org.apache.hadoop.hbase.spark").save()
Executing Command in the following format:
spark2-submit --master local[*] write_to_hbase.py
Spark Version: 2.2.0.cloudera1 (I can't change my spark version) HBase Version: 1.2.0-cdh5.12.0 (But I can change my HBase Version)
Note: I have added the hbase jars to spark2 jar folder and I ve added to following dependent jars to the spark2 jar folder.
Error :
py4j.protocol.Py4JJavaError: An error occurred while calling o70.save.
: java.lang.RuntimeException: org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:476)
I've tried multiple suggestions but nothing worked. It might be a duplicate question, but I have no other option to find a answer.
Upvotes: 0
Views: 2844
Reputation: 415
Solved it by compiling the git repo https://github.com/hortonworks-spark/shc and put the shc jar in the spark jar folder. and followed the link suggessted by @Aniket Kulkarni
the final code looks something like this,
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.types import *
properties = {
"instanceId" : "hbase",
"zookeepers" : "10-x-x-x.local:2181,10-x-x-x.local:2181,10-x-x-x.local:2181",
"hbase.columns.mapping" : "KEY_FIELD STRING :key, A STRING c:a, B STRING c:b",
"hbase.use.hbase.context" : False,
"hbase.config.resources" : "file:///etc/hbase/conf/hbase-site.xml",
"hbase.table" : "test_table"
}
spark = SparkSession.builder\
.appName("hbaseWrite")\
.getOrCreate()
sc = spark.sparkContext
catalog = ''.join("""{
"table":{"namespace":"default", "name":"test_table"}
"rowkey":"key",
"columns":{
"KEY_FIELD":{"cf":"rowkey", "col":"key", "type":"string"},
"A":{"cf":"c", "col":"a", "type":"string"},
"B":{"cf":"c", "col":"b", "type":"string"}
}
}""".split())
data = [("3","DATA 3 A", "DATA 3 B")]
columns = ['KEY_FIELD','A','B']
cSchema = StructType([StructField(columnName, StringType()) for columnName in columns])
df = spark.createDataFrame(data, schema=cSchema)
df.write\
.options(catalog=catalog)\
.options( **properties)\
.mode('overwrite').format("org.apache.spark.sql.execution.datasources.hbase").save()
Upvotes: 0
Reputation: 5480
If you are using Cloudera distribution
then Hard Luck there is no official way to write to HBASE
using PYSAPRK
. This has been confirmed by Cloudera support Team
.
But if you are using Hortonworks
and if you have spark 2.0
then the below link should get you started.
Upvotes: 2