Reputation: 1706
I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Thanks a lot
Cheers
Fabien
Upvotes: 1
Views: 19820
Reputation: 31540
Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")
Upvotes: 0
Reputation: 2853
As mentioned in other answer, you can't use SQLContext
to access Hive tables, they've given a seperate HiveContext
in Spark 1.x.x which is basically an extension of SQLContext
.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db
in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession
which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()
Upvotes: 4
Reputation: 1189
To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name
Upvotes: 1
Reputation: 330363
You cannot use standard SQLContext
to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext
.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.
Upvotes: 1