Reputation: 805
I'm using HiveContext with SparkSQL and I'm trying to connect to a remote Hive metastore, the only way to set the hive metastore is through including the hive-site.xml on the classpath (or copying it to /etc/spark/conf/).
Is there a way to set this parameter programmatically in a java code without including the hive-site.xml ? If so what is the Spark configuration to use ?
Upvotes: 30
Views: 109699
Reputation: 805
For Spark 1.x, you can set with :
System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
Or
final SparkConf conf = new SparkConf();
SparkContext sc = new SparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
hiveContext.setConf("hive.metastore.uris", "thrift://METASTORE:9083");
Update If your Hive is Kerberized :
Try setting these before creating the HiveContext :
System.setProperty("hive.metastore.sasl.enabled", "true");
System.setProperty("hive.security.authorization.enabled", "false");
System.setProperty("hive.metastore.kerberos.principal", hivePrincipal);
System.setProperty("hive.metastore.execute.setugi", "true");
Upvotes: 33
Reputation: 101
I am getting following error with Spark 2.4.8 Or Spark 3.1.3 or Spark 3.2.2. The hadoop version is 3.2, Hbase 2.4.14 and Hive 3.1.13 and Scala 2.12
Exception in thread "main" java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) I am calling spark-submit as follows.
export HBASE_JAR_FILES="/usr/local/hbase/lib/hbase-unsafe-4.1.1.jar,/usr/local/hbase/lib/hbase-common-2.4.14.jar,/usr/local/hbase/lib/hbase-client-2.4.14.jar,/usr/local/hbase/lib/hbase-protocol-2.4.14.jar,/usr/local/hbase/lib/guava-11.0.2.jar,/usr/local/hbase/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar"
/opt/spark/bin/spark-submit --master local[*] --deploy-mode client --num-executors 1 --executor-cores 1 --executor-memory 480m --driver-memory 512m --driver-class-path $(echo $HBASE_JAR_FILES | tr ',' ':') --jars "$HBASE_JAR_FILES" --files /usr/local/hive/conf/hive-site.xml --conf "spark.hadoop.metastore.catalog.default=hive" --files /usr/local/hbase/conf/hbase-site.xml --class com.hbase.dynamodb.migration.HbaseToDynamoDbSparkMain --conf "spark.driver.maxResultSize=256m" /home/hadoop/scala-2.12/sbt-1.0/HbaseToDynamoDb-assembly-0.1.0-SNAPSHOT.jar
The code is as follows.
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("Hbase To DynamoDb migration demo")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases().show()
val sqlDF = spark.sql("select rowkey, office_address, office_phone, name, personal_phone from hvcontacts")
sqlDF.show()
The hive external table was created on top of Hbase as follows.
create external table if not exists hvcontacts (rowkey STRING, office_address STRING, office_phone STRING, name STRING, personal_phone STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,Office:Address,Office:Phone,Personal:name,Personal:Phone') TBLPROPERTIES ('hbase.table.name' = 'Contacts');
The metastore is in mysql and I can query tbls table to verify the external table in hive. Is there anyone else facing similar issue?
NOTE: I am not using hive spark connector here.
Upvotes: 0
Reputation: 25189
For Spark 3.x:
// Scala
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path")
.enableHiveSupport()
.getOrCreate()
# Python
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", "hive_warehouse_hdfs_path") \
.enableHiveSupport() \
.getOrCreate()
Check databases available with:
spark.catalog.listDatabases().show()
Upvotes: 4
Reputation: 5487
I observed one strange behavior while trying connecting to hive metastore from spark without using hive-site.xml.
Everything works fine When we use hive.metastore.uris
property within spark code while creating SparkSession.
But if we don't specify in code but specify while using spark-shell
or spark-submit
with --conf
flag it will not work.
It will throw a warning as shown below and it will not connect to remote metastore.
Warning: Ignoring non-Spark config property: hive.metastore.uris
One workaround for this is to use below property.
spark.hadoop.hive.metastore.uris
Upvotes: 6
Reputation: 11
In Hadoop 3 Spark and Hive catalogs are separated so:
For spark-shell (it comes with .enableHiveSupport()
by default) just try:
pyspark-shell --conf spark.hadoop.metastore.catalog.default=hive
For spark-submit job create you spark session like this:
SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
then add this conf on your spark-submit command:
--conf spark.hadoop.metastore.catalog.default=hive
But for ORC table(and more generally internal table) it is recommended to use HiveWareHouse Connector.
Upvotes: 1
Reputation: 1194
Some of the similar questions are marked as duplicate, this is to connect to Hive from Spark without using hive.metastore.uris
or separate thrift server(9083) and not copying hive-site.xml to the SPARK_CONF_DIR.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("hive-check")
.config(
"spark.hadoop.javax.jdo.option.ConnectionURL",
"JDBC_CONNECT_STRING"
)
.config(
"spark.hadoop.javax.jdo.option.ConnectionDriverName",
"org.postgresql.Driver"
)
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.hadoop.javax.jdo.option.ConnectionUserName", "JDBC_USER")
.config("spark.hadoop.javax.jdo.option.ConnectionPassword", "JDBC_PASSWORD")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases.show(false)
Upvotes: 6
Reputation: 93
Spark Version : 2.0.2
Hive Version : 1.2.1
Below Java code worked for me to connect to Hive metastore from Spark:
import org.apache.spark.sql.SparkSession;
public class SparkHiveTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example")
.config("spark.master", "local")
.config("hive.metastore.uris",
"thrift://abc123.com:9083")
.config("spark.sql.warehouse.dir", "/apps/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
spark.sql("SELECT * FROM default.survey_data limit 5").show();
}
}
Upvotes: 4
Reputation: 74
Below code worked for me. We can ignore the config of hive.metastore.uris
for local metastore, spark will create hive objects in spare-warehouse directory locally.
import org.apache.spark.sql.SparkSession;
object spark_hive_support1
{
def main (args: Array[String])
{
val spark = SparkSession
.builder()
.master("yarn")
.appName("Test Hive Support")
//.config("hive.metastore.uris", "jdbc:mysql://localhost/metastore")
.enableHiveSupport
.getOrCreate();
import spark.implicits._
val testdf = Seq(("Word1", 1), ("Word4", 4), ("Word8", 8)).toDF;
testdf.show;
testdf.write.mode("overwrite").saveAsTable("WordCount");
}
}
Upvotes: 2
Reputation: 3062
I too faced same problem, but resolved. Just follow this steps in Spark 2.0 Version
Step1: Copy hive-site.xml file from Hive conf folder to spark conf.
Step 2: edit spark-env.sh file and configure your mysql driver. (If you are using Mysql as a hive metastore.)
Or add MySQL drivers to Maven/SBT (If using those)
Step3: When you are creating spark session add enableHiveSupport()
val spark = SparkSession.builder.master("local").appName("testing").enableHiveSupport().getOrCreate()
Sample code:
package sparkSQL
/**
* Created by venuk on 7/12/16.
*/
import org.apache.spark.sql.SparkSession
object hivetable {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[*]").appName("hivetable").enableHiveSupport().getOrCreate()
spark.sql("create table hivetab (name string, age int, location string) row format delimited fields terminated by ',' stored as textfile")
spark.sql("load data local inpath '/home/hadoop/Desktop/asl' into table hivetab").show()
val x = spark.sql("select * from hivetab")
x.write.saveAsTable("hivetab")
}
}
Output:
Upvotes: 7
Reputation: 543
In spark 2.0.+ it should look something like that:
Don't forget to replace the "hive.metastore.uris" with yours. This assume that you have a hive metastore service started already (not a hiveserver).
val spark = SparkSession
.builder()
.appName("interfacing spark sql to hive metastore without configuration file")
.config("hive.metastore.uris", "thrift://localhost:9083") // replace with your hivemetastore service's thrift url
.enableHiveSupport() // don't forget to enable hive support
.getOrCreate()
import spark.implicits._
import spark.sql
// create an arbitrary frame
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
// see the frame created
frame.show()
/**
* +-----+-----+
* | word|count|
* +-----+-----+
* | one| 1|
* | two| 2|
* |three| 3|
* +-----+-----+
*/
// write the frame
frame.write.mode("overwrite").saveAsTable("t4")
Upvotes: 31