whatsnext
whatsnext

Reputation: 627

How to enable pyspark HIVE support on Google Dataproc master node

I created a dataproc cluster and manually install conda and Jupyter notebook. Then, I install pyspark by conda. I can successfully run spark by

from pyspark import SparkSession
sc = SparkContext(appName="EstimatePi")

However, I cannot enable HIVE support. The following code gets stucked and doesn't return anything.

from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .config('spark.driver.memory', '2G')
         .config("spark.kryoserializer.buffer.max", "2000m")
         .enableHiveSupport()
         .getOrCreate())

Python version 2.7.13, Spark version 2.3.4

Any way to enable HIVE support?

Upvotes: 4

Views: 781

Answers (2)

tix
tix

Reputation: 2158

I do not recommend manually installing pyspark. When you do this, you get a new spark/pyspark installation that is different from Dataproc's own and do not get the configuration/tuning/classpath/etc. This is likely the reason Hive support does not work.

To get conda with properly configured pyspark I suggest selecting ANACONDA and JUPYTER optional components on image 1.3 (the default) or later.

Additionally, on 1.4 and later images Mini-Conda is the default user Python with pyspark preconfigured. You can pip/conda install Jupyter on your own if you wish.

See https://cloud.google.com/dataproc/docs/tutorials/python-configuration

Also as @Jayadeep Jayaraman points out, Jupyter optional component works with Component Gateway which means you can use it from a link in Developers Console as opposed to opening ports to the world or SSH tunneling.

tl/dr: I recomment these flags for your next cluster: --optional-components ANACONDA,JUPYTER --enable-component-gateway

Upvotes: 2

Jayadeep Jayaraman
Jayadeep Jayaraman

Reputation: 2825

Cloud Dataproc now has the option to install optional components in the dataproc cluster and also has an easy way of accessing them via the Gateway. You can find details of installing Jupyter and Conda here - https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

The details of the component gateway can be found here - https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways. Note that this is Alpha.

Upvotes: 2

Related Questions