Denys
Denys

Reputation: 4557

Spark can access Hive table from pyspark but not from spark-submit

So, when running from pyspark i would type in (without specifying any contexts) :

df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')

.. and it works fine.

However, when i run my script from spark-submit, like

spark-submit script.py i put the following in

from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('inc_dd_openings')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df_openings_latest = sqlContext.sql('select * from experian_int_openings_latest_orc')

But it gives me an error

pyspark.sql.utils.AnalysisException: u'Table not found: experian_int_openings_latest_orc;'

So it doesnt see my table.

What am I doing wrong? Please help

P.S. Spark version is 1.6 running on Amazon EMR

Upvotes: 19

Views: 32899

Answers (4)

Azmat Siddique
Azmat Siddique

Reputation: 21

from pyspark.sql import SparkSession
import getpass
username= getpass.getuser()
    
spark = SparkSession. \
builder. \
config("spark.ui.port","0"). \
config("spark.sql.warehouse.dir",f"/Users/{username}"/warehouse). \
enableHiveSupport(). \
appName(f'{username} | python- Processing Column Data'). \
master('yarn'). \
getOrCreate()

Upvotes: 0

zero323
zero323

Reputation: 330383

Spark 2.x

The same problem may occur in Spark 2.x if SparkSession has been created without enabling Hive support.

Spark 1.x

It is pretty simple. When you use PySpark shell, and Spark has been build with Hive support, default SQLContext implementation (the one available as a sqlContext) is HiveContext.

In your standalone application you use plain SQLContext which doesn't provide Hive capabilities.

Assuming the rest of the configuration is correct just replace:

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

with

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

Upvotes: 28

Mike Placentra
Mike Placentra

Reputation: 885

In Spark 2.x (Amazon EMR 5+) you will run into this issue with spark-submit if you don't enable Hive support like this:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()

Upvotes: 18

Brian
Brian

Reputation: 7326

Your problem may be related to your Hive configurations. If your configurations use local metastore, the metastore_db directory gets created in the directory that you started you Hive server from.

Since spark-submit is launched from a different directory, it is creating a new metastore_db in that directory which does not contain information about your previous tables.

A quick fix would be to start the Hive server from the same directory as spark-submit and re-create your tables.

A more permanent fix is referenced in this SO Post

You need to change your configuration in $HIVE_HOME/conf/hive-site.xml

property name = javax.jdo.option.ConnectionURL

property value = jdbc:derby:;databaseName=/home/youruser/hive_metadata/metastore_db;create=true

You should now be able to run hive from any location and still find your tables

Upvotes: 2

Related Questions