Dr. Fabien Tarrade
Dr. Fabien Tarrade

Reputation: 1706

pyspark, how to read Hive tables with SQLContext?

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)

I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).

I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.

With Hive context, I have no issue to query the Hive tables:

from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc) 
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320

So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:

from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc) 
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()

Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;

I added the hive-site.xml to pyspark-shell. When running

sc._conf.getAll(

I see:

('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),

My questions are:

Thanks a lot

Cheers

Fabien

Upvotes: 1

Views: 19820

Answers (4)

notNull
notNull

Reputation: 31540

Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead

>>>df=sqlContext.sql("select * from <db-name>.<table-name>")

Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes

>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")

(or)

>>>df=HiveContext(sc).table("default.text_Table") (or)

>>> hc=HiveContext(sc)

>>> df=hc.sql("select * from default.text_Table")

Upvotes: 0

Chitral Verma
Chitral Verma

Reputation: 2853

As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.

Reason::

Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby. This done so that all the users accessing Hive may see all the contents facilitated by metastore. Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.

Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()

Upvotes: 4

hitttt
hitttt

Reputation: 1189

To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.

Like below:

from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()

Also you can collect the SQL data in row type array as below:-

r = FromSSQL.collect()
print r.column_Name

Upvotes: 1

zero323
zero323

Reputation: 330363

You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.

You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.

Upvotes: 1

Related Questions