Reputation: 2385
I'm running a spark application, using EMR through pyspark
interactive shell.
I'm trying to connect to a hive table named: content_publisher_events_log which I know that is isn't empty (through my hue console using exactly the same query), though when I try to read it throuhg pyspark I get count=0 as following:
from pyspark.sql import HiveContext
Query=""" select dt
from default.content_publisher_events_log
where dt between '20170415' and '20170419'
"""
hive_context = HiveContext(sc)
user_data = hive_context.sql(Query)
user_data.count()
0 #that's the result
Also, from the console i can see that this table exists:
>>> sqlContext.sql("show tables").show()
+--------+--------------------+-----------+
|database| tableName|isTemporary|
+--------+--------------------+-----------+
| default|content_publisher...| false|
| default| feed_installer_log| false|
| default|keyword_based_ads...| false|
| default|search_providers_log| false|
+--------+--------------------+-----------+
>>> user_data.printSchema()
root
|-- dt: string (nullable = true)
Also checked on the spark history server - seems like the job that ran the count worked without any errors, any idea on what could go wrong?
Thank's in advance!
Upvotes: 0
Views: 412
Reputation: 2385
It seems like our data team moved per each partition the parquet file into a subfolder, they fixed it and starting from April 25th it works perfectly.
As far as i know if anyone is facing this isseue, try something like this one:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
or this one:
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive","true")
Upvotes: 0
Reputation: 3619
The dt column isnt in datetime format . Either properly change the column itself to have datetime format or change the query itself to cast string as timestamp
Query=""" select dt
from default.content_publisher_events_log
where dt between
unix_timestamp('20170415','yyyyMMdd') and
unix_timestamp('20170419','yyyyMMdd')
"""
Upvotes: 1