HiveMetaStore Error in Pyspark Shell but not in Jupyter Notebook

Question

I have a strange thing going on when I attempt to use pyspark dataframe or sql. While it works in ipython notebook or a python console, I get the "javax.jdo.JDOFatalInternalException: Error creating transactional connection factory" error when I run it in the pyspark shell.

In short, everything works if I run the following in iPython Notebook or simply python terminal:

import findspark
findspark.init("C:\Spark\spark-2.3.3-bin-hadoop2.7")

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.sql('''select 'spark' as hello ''').show()

When I open up just the pyspark shell by typing 'pyspark', I execute this: (SparkSession already initialized):

spark.sql('''select 'spark' as hello ''').show()

And I am thrown the error:

>>> spark.sql('''select 'spark' as hello ''').show()
2019-05-12 18:41:35 WARN  HiveMetaStore:622 - Retrying creating default database after error: Error creating transactional connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
...
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

It's very strange, any idea why it works in one setting but not the other? Thanks!

Edit: More of the error:

java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'root'@'localhost' (using password: YES)

maverick · Accepted Answer

I got it to work. So when starting Spark you have two options for your "spark.sql.catalogImplementation" settings (hive or in-memory). I am using Windows and have had a headache setting up Hive to work with pyspark. The jupyter notebook running pyspark for some reason doesn't implement that setting (for that reason it was working). However, when running interactive pyspark that setting was running with the default value spark.sql.catalogImplementation=hive. If you want to avoid the hive headaches simply give the parameter at run time as such:

pyspark --conf spark.sql.catalogImplementation=in-memory

Then run this line to test it works:

spark.sql('''select 'spark' as hello ''').show()

If that runs, then everything is working fine.

If you want to make that setting the default, simply go to your spark directory and edit the file conf/spark-defaults.conf and just add the setting 'spark.sql.catalogImplementation=in-memory'. It will probably be a TEMPLATE file initially so make sure to save it as a .conf file. After that, everything you startup pyspark, you should have no problems with hive.

Another way to check is to go to the UI when your pyspark session starts and check the environment page (http://localhost:4041/environment/). There under 'Spark Properties' you can see what value spark.sql.catalogImplementation has (I'm sure you can also inspect that value within the interactive shell as well).

Again, I am simply running pyspark locally on my Windows machine, but now, pyspark along with dataframe support works seamlessly both with jupyter notebooks and the interactive shell!

HiveMetaStore Error in Pyspark Shell but not in Jupyter Notebook

Answers (1)

Related Questions