Reputation: 2051
What's the difference between Spark using Hive metastore and Spark running as hive execution engine? I have followed THIS TUTORIAL to configure spark and hive, and I have successfully created, populated and analysed data from hive table. Now what confuses me is what have I done?
a) Did I configure Spark to use Hive metastore and analysed data in hive table using SparkSQL?
b) Did I actually used Spark as Hive execution engine and analysed data in
hive table using HiveQL,which is what I want to do.
I will try to summarize what I have done to configure spark and hive
a) I followed that above tutorial and configured spark and hive
b) Wrote my /conf/hive-site.xml Like this and
c) After that I wrote some codes that would connect to hive metastore and do my analysis. I am using java for this and this piece of code starts spark session
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://saurab:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.master("local")
.getOrCreate();
And this piece of code will create database and table. Here db=mydb and table1=mytbl
String query = "CREATE DATABASE IF NOT EXISTS " + db;
spark.sql(query);
String query = "CREATE EXTERNAL TABLE IF NOT EXISTS " + db + "." + table1
+ " (icode String, " +
"bill_date String, " +
"total_amount float, " +
"bill_no String, " +
"customer_code String) " +
"COMMENT \" Sales details \" " +
"ROW FORMAT DELIMITED FIELDS TERMINATED BY \",\" " +
"LINES TERMINATED BY \"\n\" " +
"STORED AS TEXTFILE " +
"LOCATION 'hdfs://saurab:9000/ekbana2/' " +
"tblproperties(\"skip.header.line.count\"=\"1\")";
spark.sql(query);
Then I create jar and run it using spark-submit
./bin/spark-submit --master yarn --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar --verbose --properties-file /home/saurab/hadoopec/spark/conf/spark-env.sh --files /home/saurab/hadoopec/spark/conf/hive-site.xml --class HiveRead /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar
Doing this I get what I want but I am not very sure I am doing what I really want to do. My question might seem somewhat difficult to understand because I don't know how to explain it.If so please comment and I will try to expand my question
Also if there is any tutorial that focuses on spark+hive working, please provide me link and I also want to know if spark reads spark/conf/hive-site.xml
or hive/conf/hive-site.xml
because I am confused where to set hive.execution.engine=spark
.
Thanks
Upvotes: 1
Views: 2299
Reputation: 1072
It seems like you're doing two opposite things at once. The tutorial you linked to is instructions to use Spark as Hive's execution engine (what you described as option b). This means that you will run your hive queries almost exactly at before, but behind the scenes Hive will use Spark instead of classic MapReduce. In that case you don't need to write any Java code that uses SparkSession etc'. The code you were writing is doing what you described in option a - using Spark to run Hive queries and use the Hive metastore.
So in summary, you don't need to do both - either use the first tutorial configure Spark as you Hive execution engine (of course this will still require installing Spark etc'), OR, write Spark code that executes Hive queries.
Upvotes: 1