Reputation: 27455
I have registered a temporary table with Spark SQL, as described in [this section]:
people.registerTempTable("people")
// I can run queries on it all right.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Now I want to access this table remotely through JDBC. I start up the Thrift Server as described in [this other section].
./sbin/start-thriftserver.sh --master spark://same-master-as-above:7077
But the table is not visible.
0: jdbc:hive2://localhost:10000> show tables;
+---------+
| result |
+---------+
+---------+
No rows selected (2.216 seconds)
I guess this is because the table is "temporary" (i.e. tied to the lifetime of the SqlContext
object). But how do I make non-temporary tables?
I can see Hive tables through the Thrift Server, but I don't see how I could expose an RDD like this. I've found a comment that suggests I cannot.
Or should I run the Thrift Server in my application with my own SqlContext
? Almost all classes around it are private
, and this code is not in Maven Central (as far as I see). Am I supposed to use HiveThriftServer2.startWithContext
? It's undocumented and @DeveloperApi
, but might work.
Upvotes: 19
Views: 13772
Reputation: 56
Modify spark-defaults.conf
and add spark.sql.hive.thriftServer.singleSession true
.
This allows the Thrift server to see temp tables based directly on RDD without having to save the table. You can also do CACHE TABLE XXX AS <query>
in Spark SQL and have it be exposed via ODBC/JDBC.
Upvotes: 4
Reputation: 5405
The general idea behind a temp table, at least from the original RDBMS intention is that it has very limited scope, usually the user session, and goes out of scope & is cleaned up when the session ends.
If you create a temp table in one session, then another session that connects via JDBC or ODBC to the thriftserver is not going to be able to see it, and that's by design. If you need a table to persist beyond a session then it's not temporary by definition.
The solution suggested to "enable spark.sql.hive.thriftServer.singleSession in order to share them with your SQL clients" works yes, but kind of defeats the purpose of having a multi-session environment. Setting single session is effectively forcing temp tables to behave as persistent tables, except they will still go out of context if the hive meta-store restarts.
It's easy enough to use the SqlContext to CREATE TABLE, or use saveAsTable() see: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#saving-to-persistent-tables
EDIT, It appears saveAsTable is mentioned first in the spark 1.3 documentation, but you can use the inlineSQL method if you are on an earlier version.
people.write.saveAsTable("people")
That will persist the people table in your default database. Better to create a database first and specify it:
people.write.saveAsTable("databaseName.people")
or with inline SQL
sqlContext.sql("CREATE TABLE IF NOT EXISTS people
as SELECT name FROM people WHERE age >= 13 AND age <= 19")
Both of those methods will register the persistent table in the hive metastore. saveAsTable also has options as to whether you want to append or overwrite
Upvotes: 3
Reputation: 403
Most probably you have already solved this problem. But I tried a similar use case recently and wanted to share my findings. To expose Spark data over JDBC, you need to do the following:
Start the thrift-server that comes with Spark (I used version 1.3.1), the thrift-server with Hive might as well work but I haven't tested that
/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001
Ensure you don't give "localhost" but the actual ip-address/dnsname of the server to "hive.server2.thrift.bind.host", otherwise you cant connect to this thrift server from a different host
SPARK_HOME/bin/beeline
beeline> !connect jdbc:hive2://spark-master:10001
you can use any hiveql here to load the data you want into a table, but the sql will be executed in Spark cluster
create temporary table tweets using org.apache.spark.sql.parquet options (path 'maprfs:///path-to-your-parquet-files/');
cache the table in spark memory
cache table tweets
thats it, all your data is now cached in the spark cluster and you can query this with low latency over remote jdbc
the jdbc coding is exactly the same as writing any hive jdbc code because the thrift server understand the hiveql and translates that to spark sql behind. Look here for hive jdbc examples:
For username, you can give any linux username in the thrift-server machine and leave password blank (this is non-secured mode, you can make things more secured if needed)
the table "tweets" will not be visible with "show tables" etc because the table is not registered in the hive metastore but you can do all sql queries with the table over jdbc
you can use Spark to cache data in memory and query that over remote jdbc/odbc. This will yield low latency responses, I got <2 secs responses for data size of about 40 GB. But, the caching will not work for really big data, like tera bytes. In that case, avoid the "cache table ..." command and you can still use the remote jdbc/odbc, but responses will not be in the low latency range
All the best!
MK
Upvotes: 17
Reputation: 27455
From SPARK-3675:
A common question on the mailing list is how to read from temporary tables over JDBC. While we should try and support most of this in SQL, it would also be nice to query generic RDDs over JDBC.
And the solution (coming in Spark 1.2.0) is indeed to use HiveThriftServer2.startWithContext
.
Upvotes: 3