Pitt
Pitt

Reputation: 3

Spark cannot query Hive tables it can see?

I'm running the prebuilt version of Spark 1.2 for CDH 4 on CentOS. I have copied the hive-site.xml file into the conf directory in Spark so it should see the Hive metastore.

I have three tables in Hive (facility, newpercentile, percentile), all of which I can query from the Hive CLI. After I log into Spark and create the Hive Context like so: val hiveC = new org.apache.spark.sql.hive.HiveContext(sc) I am running into an issue querying these tables.

If I run the following command: val tableList = hiveC.hql("show tables") and do a collect() on tableList, I get this result: res0: Array[org.apache.spark.sql.Row] = Array([facility], [newpercentile], [percentile])

If I then run this command to get the count of the facility table: val facTable = hiveC.hql("select count(*) from facility"), I get the following output, which I take to mean that it cannot find the facility table to query it:

scala> val facTable = hiveC.hql("select count(*) from facility")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/12/26 10:27:26 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

14/12/26 10:27:26 INFO ParseDriver: Parsing command: select count(*) from facility
14/12/26 10:27:26 INFO ParseDriver: Parse Completed
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(355177) called with curMem=0, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 346.9 KB, free 264.6 MB)
14/12/26 10:27:26 INFO MemoryStore: ensureFreeSpace(50689) called with curMem=355177, maxMem=277842493
14/12/26 10:27:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 49.5 KB, free 264.6 MB)
14/12/26 10:27:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:45305 (size: 49.5 KB, free: 264.9 MB)
14/12/26 10:27:26 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
14/12/26 10:27:26 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:68

facTable: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[2] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==

Aggregate false, [], [Coalesce(SUM(PartialCount#38L),0) AS _c0#5L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#38L]
   HiveTableScan [], (MetastoreRelation default, facility, None), None

Any assistance would be appreciated. Thanks.

Upvotes: 0

Views: 2953

Answers (1)

Mike Park
Mike Park

Reputation: 10941

scala> val facTable = hiveC.hql("select count(*) from facility")

Great! You have an RDD, now what do you want to do with it?

scala> facTable.collect()

Remember that an RDD is an abstraction on top of your data and is not materialized until you invoke an action on it such as collect() or count().

You would get a very obvious error if you tried to use a non-existent table name.

Upvotes: 3

Related Questions