Mark Morrisson
Mark Morrisson

Reputation: 2703

PySpark on Windows: Hive issues

I'm trying to run LogisticRegressionWithLBFGS from Mllib and I get many Hive issues:

py4j.protocol.Py4JJavaError: An error occurred while calling o337.trainLogisticRegressionModelWithLBFGS.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

The fact is I didn't even install Hive... But why does this function rely on Hive? It is written nowhere in the documentation... Is it a prerequisite to install Hive to run any Mllib function?

Upvotes: 0

Views: 471

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191904

A Hive installation is not needed, but Spark needs Hive-compatible classes to operate on DataFrame objects, such as those within an ML pipeline step.

The pip install pyspark, for example, doesn't come with these (or any Hadoop) libraries, as far as I know.

If you downloaded Spark with Hadoop from the Apache site, then you will get Hive libraries and a bin/pyspark script. On windows, though, you might need to setup WinUtils.

Upvotes: 1

Related Questions