Reputation: 347
I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.
The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:
df.repartition('datekey','coeff')\
.write\
.format('parquet')\
.partitionBy('datekey','coeff')\
.mode('overwrite')\
.option("path", S3_PATH)\
.saveAsTable('hive_tables.my_table')
I then try to access this table with Spark SQL, but when I run something as simple as
spark.sql('select * from hive_tables.my_table where datekey=20210506').show()
,
it throws this:
An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine:
spark.sql('select * from hive_tables.my_table where coeff=0.5').show()
and
spark.sql('select * from hive_tables.my_table').show()
I've verified through Spark SQL that the partitions exist and have data in them. The datekey
query also works fine through AWS Athena - just not Spark SQL.
Also Glue definitely has the two partition columns recognized:
datekey: int
coeff: double
Any ideas here? I've tried everything I can think of and it just isn't making any sense.
Upvotes: 1
Views: 4658
Reputation: 364
I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:
select *
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')
This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.
I got
User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='
Then I "modified" the query to this one, and it worked!
select *
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))
Upvotes: 1
Reputation: 64
I had same error In emr 6.3.0 (Spark 3.1.1).
After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.
Upvotes: 1
Reputation: 347
I would still like a straight-forward solution to this, but currently this workaround suffices:
I first read the table straight from the S3 path
temp_df = spark.read.parquet(S3_PATH)
so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:
temp_df.createGlobalTempView('my_table')
which allows me to query it using Spark SQL with the global_temp
database:
spark.sql('select * from global_temp.my_table where datekey=20210506').show()
and this works
Upvotes: 0