hiryu
hiryu

Reputation: 1416

Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

I just upgraded from Spark 2.0.2 to Spark 2.1.0 (by downloading the prebuilt version for Hadoop 2.7&later). No Hive is installed.

Upon launch of the spark-shell, the metastore_db/ folder and derby.log file are created at the launch location, together with a bunch of warning logs (which were not printed in the previous version).

Closer inspection of the debug logs shows that Spark 2.1.0 tries to initialise a HiveMetastoreConnection:

17/01/13 09:14:44 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.

Similar debug logs for Spark 2.0.2 do not show any initialisation of HiveMetastoreConnection.

Is this intended behaviour? Could it be related to the fact that spark.sql.warehouse.dir is now a static configuration shared among sessions? How do I avoid this, since I have no Hive installed?

Thanks in advance!

Upvotes: 1

Views: 7004

Answers (3)

Thomas Decaux
Thomas Decaux

Reputation: 22711

This happen also with Spark 1.6. You can change the path by adding in Spark submit extra options:

-Dderby.system.home=/tmp/derby

(or by derby.properties, there are several ways to change it).

Upvotes: 0

hiryu
hiryu

Reputation: 1416

For future googlers: the actual underlying reason for the creation of metastore_db and derby.log in every working directory is the default value of derby.system.home.

This can be changed in spark-defaults.conf, see here.

Upvotes: 5

Alexey Svyatkovskiy
Alexey Svyatkovskiy

Reputation: 646

From Spark 2.1.0 documentation pages:

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.

Since you do not have Hive installed, you will not have a hive-site.xml config file, and this must be defaulting to the current directory.

If you are not planning to use HiveContext in Spark, you could reinstall Spark 2.1.0 from source, rebuilding it with Maven and making sure you omit -Phive -Phive-thriftserver flags which enable Hive support.

Upvotes: 5

Related Questions