Reputation: 12364
Anyone care to hazard a guess why a query executed in both Hive and with Spark's dataframe API returns different results (the answer returned from Hive is the correct one by the way)
Hive:
gb-slo-svb-0019:10000 > select count(*) from sseft.feat_promo_prod_store_period;
INFO - > select count(*) from sseft.feat_promo_prod_store_period
_c0
84071294
Spark:
sqlContext.sql('select count(*) from sseft.feat_promo_prod_store_period').show()
+---+
|_c0|
+---+
| 0|
+---+
Interestingly if I point to the underlying hdfs location using spark rather than the hive table I get the correct answer:
sqlContext.read.parquet('/Lev4/sse/hive/sseft/feat_promo_prod_store_period').count()
84071294
This image depicts all three:
Upvotes: 1
Views: 1890
Reputation: 13946
The easiest way to determine what causes that behaviour is to look at explain()
results. Compare these:
sqlContext.sql('select * from sseft.feat_promo_prod_store_period').explain()
sqlContext.read.parquet('/Lev4/sse/hive/sseft/feat_promo_prod_store_period').explain()
If they are not the same you should look on how the table is created, for example sqlConext.sql('show create table sseft.feat_promo_prod_store_period').first()
Upvotes: 1
Reputation: 1335
check your hive-site.xml, it should be copied to spark conf directory and it should have the following configuration.
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host.xxx.com:9083</value>
</property>
</configuration>
Upvotes: 0