Reputation: 23
i am using cloudera-quickstart-vm-5.12.0-0-virtualbox-disk1 for my BigData practise.
i was trying to integrate spark and hive using a scala code. this scala code is being written in windows eclipse. once the code is written i will create a jar and then pass this jar to my cloudera cluster. from there i will execute this code using spark-submit.
Spark Logic -
package sparkhiveintegration_package_spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
object objSparkHiveIntegration {
def main(arg:Array[String]):Unit=
{
val warehouseLocation = "/user/hive/warehouse/"
val sc = new SparkConf().setAppName("SparkHiveIntegrationTest").setMaster("local[*]")
.set("spark.sql.warehouse.dir", warehouseLocation)
val sctxt = new SparkContext(sc)
sctxt.setLogLevel("Error")
val ssc = SparkSession.builder().config(sc)
.enableHiveSupport().getOrCreate()
import ssc.implicits._
ssc.sql("use practise")
val sprk_hve_read_df = ssc.sql("select * from customers")
sprk_hve_read_df.show()
//sprk_hve_read_df.write.format("csv").option("header","true").mode("overwrite")
// .save("user/cloudera/sprk_hive_integration/")
}
}
Steps i took -
in eclipse i have added the maven dependencies to pom.xml as mentioned in https://sparkbyexamples.com/spark/how-to-connect-spark-to-remote-hive/
my spark version is 2.3.1 Dependencies added into eclipse - (i followed the steps mentioned in https://learnjava.co.in/how-to-add-maven-dependencies-via-eclipse/)
i then copy pasted hive-site.xml,core-site.xml,hdfs-site.xml to spark conf directory which is /usr/local/spark/conf/
what i want to do is i have created a database called practise from hive shell using the command create database database_name
and so on.
i used sqoop job to move a customer table csv file to hive. now my table customer is in practise database under hive. the path is /user/hive/warehouse/practise.db/customers
and the file is in /user/hive/warehouse/practise.db/customers/customer.csv
.
when i run the sql statement in hive select * from practise.customers limit by 10;
it retrieves the data correctly.
so i want the code that i have mentioned above to retrieve the data and then i wanted to perform some transformation, actions joins etc etc and then write it in an avro format to some hdfs location.
my spark-submit command - spark-submit --master local[*] --class sparkhiveintegration_package_spark.objSparkHiveIntegration /home/cloudera/externalJars/SparkRDDPractiseSession-0.0.1-SNAPSHOT.jar
when i run this command i get errors like
table or view not found.
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'practise' not found;
Upvotes: 1
Views: 48