Yehoshaphat Schellekens
Yehoshaphat Schellekens

Reputation: 2385

sqlContext.sql returnts table with count 0 , pyspark

I'm running a spark application, using EMR through pyspark interactive shell.

I'm trying to connect to a hive table named: content_publisher_events_log which I know that is isn't empty (through my hue console using exactly the same query), though when I try to read it throuhg pyspark I get count=0 as following:

from pyspark.sql import HiveContext 
Query=""" select dt
from default.content_publisher_events_log 
where  dt  between  '20170415'  and '20170419' 
"""
hive_context = HiveContext(sc)
user_data = hive_context.sql(Query)
user_data.count()
0 #that's the result

Also, from the console i can see that this table exists:

    >>> sqlContext.sql("show tables").show()
+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|content_publisher...|      false|
| default|  feed_installer_log|      false|
| default|keyword_based_ads...|      false|
| default|search_providers_log|      false|
+--------+--------------------+-----------+

>>> user_data.printSchema()
root
 |-- dt: string (nullable = true)

Also checked on the spark history server - seems like the job that ran the count worked without any errors, any idea on what could go wrong?

Thank's in advance!

Upvotes: 0

Views: 412

Answers (2)

Yehoshaphat Schellekens
Yehoshaphat Schellekens

Reputation: 2385

It seems like our data team moved per each partition the parquet file into a subfolder, they fixed it and starting from April 25th it works perfectly.

As far as i know if anyone is facing this isseue, try something like this one:

sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")

or this one:

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive","true") 

Upvotes: 0

Pushkr
Pushkr

Reputation: 3619

The dt column isnt in datetime format . Either properly change the column itself to have datetime format or change the query itself to cast string as timestamp

Query=""" select dt
from default.content_publisher_events_log 
where  dt  between  
unix_timestamp('20170415','yyyyMMdd')  and 
unix_timestamp('20170419','yyyyMMdd') 
"""

Upvotes: 1

Related Questions