Reputation: 23
I am having hard time debugging why a simple query against hive external table (dynamodb backed) is taking north of 10 minutes via spark-submit and it only takes 4 seconds in hive shell.
Hive External Table that refers to a Dynamodb table say Employee[id, name, ssn, dept]. id is the partition key and ssn is the range key.
Using aws emr 5.29, spark, hive, tez, hadoop. 1 master, 4 core, m5.l
in hive shell: select name, dept, ssn from employee here id='123/ABC/X12I'
returns results in 4 seconds.
Now, lets say i have the following code in code.py (ignoring the imports)
spark = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data=spark.sql("select name, dept, ssn from employee here id='123/ABC/X12I'")
# print data or get length
I submit the above on the master node as:
spark-submit --jars /pathto/emr-ddb-hive.jar, /pathto/emr-ddb-hadoop.jar code.py
The above spark submit takes a long time 14+ minutes. I am not sure which parameter needs to be tweaked or set to get better response time.
In hive shell I did a SET
; to view the parameters that hive shell is using and there are a gazillion.
I also tried a boto3 dynamodb way of searching and it is way faster than my simple py sql to spark-submit.
I am missing fundamentals...Any idea or direction is appreciated.
Upvotes: 1
Views: 219
Reputation: 23
I was doing an aggregation when I was trying to print by doing a collect() . I read about it but, did not realize that it was that bad (timing wise). I also did end up doing some more experiments like take(n) limit 1.
Upvotes: 0