Saurab
Saurab

Reputation: 2051

Hive on Spark and Spark as hive execution engine: What's the difference

What's the difference between Spark using Hive metastore and Spark running as hive execution engine? I have followed THIS TUTORIAL to configure spark and hive, and I have successfully created, populated and analysed data from hive table. Now what confuses me is what have I done?

a) Did I configure Spark to use Hive metastore and analysed data in hive table using SparkSQL?
b) Did I actually used Spark as Hive execution engine and analysed data in hive table using HiveQL,which is what I want to do.

I will try to summarize what I have done to configure spark and hive

a) I followed that above tutorial and configured spark and hive
b) Wrote my /conf/hive-site.xml Like this and
c) After that I wrote some codes that would connect to hive metastore and do my analysis. I am using java for this and this piece of code starts spark session

SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .enableHiveSupport()
                .config("spark.sql.warehouse.dir", "hdfs://saurab:9000/user/hive/warehouse")
                .config("mapred.input.dir.recursive", true)
                .config("hive.mapred.supports.subdirectories", true)
                .config("spark.sql.hive.thriftServer.singleSession", true)
                .master("local")
                .getOrCreate();

And this piece of code will create database and table. Here db=mydb and table1=mytbl

String query = "CREATE DATABASE IF NOT EXISTS " + db;
        spark.sql(query);
String query = "CREATE EXTERNAL TABLE IF NOT EXISTS " + db + "." + table1
                + " (icode String, " +
                "bill_date String, " +
                "total_amount float, " +
                "bill_no String, " +
                "customer_code String) " +
                "COMMENT \" Sales details \" " +
                "ROW FORMAT DELIMITED FIELDS TERMINATED BY \",\" " +
                "LINES TERMINATED BY  \"\n\" " +
                "STORED AS TEXTFILE " +
                "LOCATION 'hdfs://saurab:9000/ekbana2/' " +
                "tblproperties(\"skip.header.line.count\"=\"1\")";

        spark.sql(query);

Then I create jar and run it using spark-submit

./bin/spark-submit --master yarn  --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar --verbose --properties-file /home/saurab/hadoopec/spark/conf/spark-env.sh --files /home/saurab/hadoopec/spark/conf/hive-site.xml --class HiveRead  /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar 

Doing this I get what I want but I am not very sure I am doing what I really want to do. My question might seem somewhat difficult to understand because I don't know how to explain it.If so please comment and I will try to expand my question

Also if there is any tutorial that focuses on spark+hive working, please provide me link and I also want to know if spark reads spark/conf/hive-site.xml or hive/conf/hive-site.xml because I am confused where to set hive.execution.engine=spark. Thanks

Upvotes: 1

Views: 2299

Answers (1)

Dean Gurvitz
Dean Gurvitz

Reputation: 1072

It seems like you're doing two opposite things at once. The tutorial you linked to is instructions to use Spark as Hive's execution engine (what you described as option b). This means that you will run your hive queries almost exactly at before, but behind the scenes Hive will use Spark instead of classic MapReduce. In that case you don't need to write any Java code that uses SparkSession etc'. The code you were writing is doing what you described in option a - using Spark to run Hive queries and use the Hive metastore.

So in summary, you don't need to do both - either use the first tutorial configure Spark as you Hive execution engine (of course this will still require installing Spark etc'), OR, write Spark code that executes Hive queries.

Upvotes: 1

Related Questions