jamiet
jamiet

Reputation: 12364

Hive query returns wrong answer when executed thru spark

Anyone care to hazard a guess why a query executed in both Hive and with Spark's dataframe API returns different results (the answer returned from Hive is the correct one by the way)
Hive: gb-slo-svb-0019:10000 > select count(*) from sseft.feat_promo_prod_store_period; INFO - > select count(*) from sseft.feat_promo_prod_store_period _c0 84071294
Spark: sqlContext.sql('select count(*) from sseft.feat_promo_prod_store_period').show() +---+ |_c0| +---+ | 0| +---+

Interestingly if I point to the underlying hdfs location using spark rather than the hive table I get the correct answer: sqlContext.read.parquet('/Lev4/sse/hive/sseft/feat_promo_prod_store_period').count() 84071294

This image depicts all three:

enter image description here

Upvotes: 1

Views: 1890

Answers (2)

Mariusz
Mariusz

Reputation: 13946

The easiest way to determine what causes that behaviour is to look at explain() results. Compare these:

sqlContext.sql('select * from sseft.feat_promo_prod_store_period').explain()
sqlContext.read.parquet('/Lev4/sse/hive/sseft/feat_promo_prod_store_period').explain()

If they are not the same you should look on how the table is created, for example sqlConext.sql('show create table sseft.feat_promo_prod_store_period').first()

Upvotes: 1

Arvind Kumar
Arvind Kumar

Reputation: 1335

check your hive-site.xml, it should be copied to spark conf directory and it should have the following configuration.

<configuration>
  <property>
  <name>hive.metastore.uris</name>
  <value>thrift://host.xxx.com:9083</value>
  </property>
</configuration>

Upvotes: 0

Related Questions