skybluelee
skybluelee

Reputation: 13

How can I change location of default database for the warehouse?(spark)

    ...
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>hdfs://spark-master-01:9000/skybluelee/skybluelee_warehouse_mysql_5.7</value>
        <description>location of default database for the warehouse</description>
    </property>
    ...

the code is a part of /user/spark3/conf/hive-site.xml

At first the value was

hdfs://spark-master-01:9000/kikang/skybluelee_warehouse_mysql_5.7

And I changed the value

hdfs://spark-master-01:9000/skybluelee/skybluelee_warehouse_mysql_5.7

Below there is a code and result

println(spark.conf.get("spark.sql.warehouse.dir"))  //--Default : spark-warehouse

spark
    .sql("""
        SELECT 
            website, 
            avg(age) avg_age, 
            max(id) max_id
        FROM 
            people a 
            JOIN 
            projects b 
            ON a.name = b.manager 
        WHERE 
            a.age > 11 
        GROUP BY 
            b.website
        """)
    .write
    .mode("overwrite")  //--Overwrite mode....
    .saveAsTable("JoinedPeople")  //--saveAsTable(<warehouse_table_name>)....
    
sql("SELECT * FROM JoinedPeople").show(1000)

hdfs://spark-master-01:9000/skybluelee/skybluelee_warehouse_mysql_5.7
+--------------------+-------+------+
|             website|avg_age|max_id|
+--------------------+-------+------+
|http://hive.apach...|   30.0|     2|
|http://kafka.apac...|   19.0|     3|
|http://storm.apac...|   30.0|     9|
+--------------------+-------+------+

value 'spark.sql.warehouse.dir' was changed kikang into skybluelee as I want.

but the location of table "JoinedPeople" doesn't change. The location is 'hdfs://spark-master-01:9000/kikang/skybluelee_warehouse_mysql_5.7' - the first value in hive-site.xml

I want to change the location of default database.

How can I change the default location?

I changed 'spark-defaults.conf' and of courese turn off & on the ubuntu. But not effective

Upvotes: 0

Views: 1034

Answers (2)

skybluelee
skybluelee

Reputation: 13

I found what I missed!

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://spark-worker-01:3306/metastore_db?createDatabaseIfNotExist=true</value>
  <description>metadata is stored in a MySQL server</description>
</property>

this is a part of warehouse in mysql_5.7 when I tried first in metastore_db was created, so even I changed the location, that doesn't make change in metastore

Upvotes: 1

Scott Hsieh
Scott Hsieh

Reputation: 1493

May you check what your Spark version is in this scenario? According to Hive Tables in the official Spark documentation:

Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.

  1. Does changing the property in hive-site.xml work for you (assuming the Spark version is above 2.0.0)?

     ...
     <property>
         <name>spark.sql.warehouse.dir</name>
         <value>hdfs://spark-master-01:9000/skybluelee/skybluelee_warehouse_mysql_5.7</value>
         <description>location of default database for the warehouse</description>
     </property>
     ...
    
  2. Does setting the property before the initialization of a Spark session work for you?

    import org.apache.spark.sql.SparkSession
    
    val warehouseLocation = "hdfs://spark-master-01:9000/skybluelee/skybluelee_warehouse_mysql_5.7"
    
    // Create a SparkSession with the desired warehouse location
    val spark = SparkSession
      .builder()
      .appName("Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate()
    
    // Import the necessary Spark functions and implicit
    import spark.implicits._
    import spark.sql
    sql("""
     SELECT 
         website, 
         avg(age) avg_age, 
         max(id) max_id
     FROM 
         people a 
         JOIN 
         projects b 
         ON a.name = b.manager 
     WHERE 
         a.age > 11 
     GROUP BY 
         b.website
     """)
       .write
       .mode("overwrite")
       .saveAsTable("JoinedPeople")
    
    // Retrieve the location of the "JoinedPeople" table from the Hive metastore
    val tableLocation = spark.sql("DESCRIBE EXTENDED JoinedPeople").filter($"col_name" === "Location").select("data_type").collect()(0)(0)
    println(s"Table location: $tableLocation")
    
    
    
    

Upvotes: 0

Related Questions