Unable to run pyspark 2.X due to hive metastore connectivity issues

Question

When running pyspark 1.6.X it comes up just fine.

17/02/25 17:35:41 INFO storage.BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.13 (default, Dec 17 2016 23:03:43)
SparkContext available as sc, SQLContext available as sqlContext.
>>>

But after I reset SPARK_HOME, PYTHONPATH and PATH to point to spark 2.x installation, things go south quickly

(a) I have to manually delete a derby metastore_db each time.

(b) pyspark does not launch: it hangs after printing these unhappy warnings:

[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/25 17:32:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/25 17:32:53 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/25 17:32:53 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException

I do not need/care for hive capabilities: but it may well be they are required in case of spark 2.X. What is the simplest working configuration for hive to make pyspark 2.X happy?

santon · Accepted Answer

Have you tried the enableHiveSupport function? I had issues with DataFrames when migrating from 1.6 to 2.x, even when I wasn't accessing Hive. Calling that function on the builder solved my problem. (You can also add it to the config.)

If you're using the pyspark shell to provision your Spark context, to enable hive support you'll need to do so via the config. In your spark-defaults.conf try adding spark.sql.catalogImplementation hive.

Unable to run pyspark 2.X due to hive metastore connectivity issues

Answers (1)

Related Questions