Ravi Narayanan
Ravi Narayanan

Reputation: 31

Start HiveThriftServer programmatically in Python

In the spark-shell (scala), we import, org.apache.spark.sql.hive.thriftserver._ for starting Hive Thrift server programatically for a particular hive context as HiveThriftServer2.startWithContext(hiveContext) to expose a registered temp table for that particular session.

How can we do the same using python? Is there a package / api on python for importing HiveThriftServer? Any other thoughts / recommendations appreciated.

We have used pyspark for creating a dataframe

Thanks

Ravi Narayanan

Upvotes: 3

Views: 1909

Answers (2)

bschreiber
bschreiber

Reputation: 66

For Spark 3, the following works:

from py4j.java_gateway import java_import
from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc = spark.sparkContext

java_import(sc._jvm, "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2")
args = sys.argv[1:]
java_args = sc._gateway.new_array(sc._gateway.jvm.java.lang.String, len(args))

for i, arg in enumerate(args):
    java_args[i] = arg
sc._jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(java_args)

Note that the main method of the HiveThriftServer2 class calls the startWithContext method. (See here for the source code)

Upvotes: 0

Sasinda Rukshan
Sasinda Rukshan

Reputation: 449

You can import it using py4j java gateway. The following code worked for spark 2.0.2 and could query temp tables registered in python script through beeline.

from py4j.java_gateway import java_import
java_import(sc._gateway.jvm,"")

spark = SparkSession \
        .builder \
        .appName(app_name) \
        .master(master)\
        .enableHiveSupport()\
        .config('spark.sql.hive.thriftServer.singleSession', True)\
        .getOrCreate()
sc=spark.sparkContext
sc.setLogLevel('INFO')

#Start the Thrift Server using the jvm and passing the same spark session corresponding to pyspark session in the jvm side.
sc._gateway.jvm.org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark._jwrapped)

spark.sql('CREATE TABLE myTable')
data_file="path to csv file with data"
dataframe = spark.read.option("header","true").csv(data_file).cache()
dataframe.createOrReplaceTempView("myTempView")

Then go to beeline to check if it correclty started:

in terminal> $SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
beeline> show tables;

It should show the tables and temp tables/views created in python including "myTable" and "myTempView" above. It is necessary to have the same spark session in order to see temporary views

(see ans: Avoid starting HiveThriftServer2 with created context programmatically.
NOTE: It's possible to access hive tables even if the Thrift server is started from terminal and connected to the same metastore, however temp views cannot be accessed as they are in the spark session and not written to metastore)

Upvotes: 5

Related Questions