Reputation: 1358
Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via
spark.catalog.setCurrentDatabase("test")
spark.catalog.listTables
However when I submit a job via spark-submit
I get a fatal error
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.;
I am creating my SparkSession within the job being submitted via spark-submit
via
SparkSession.builder.enableHiveSupport.getOrCreate
Upvotes: 6
Views: 20975
Reputation: 61
You should check the option "Use Glue data catalog as the Hive metastore" inside the Glue job; that's fundamental, otherwise Spark won't see the Glue catalog and will only see the "default" Database created by Glue.
Upvotes: 1
Reputation: 347
My problem ended up being that another classification configuration had been interfering with the spark-hive-site
one. I deleted all others, and it finally was able to connect.
Upvotes: 0
Reputation: 191
Adding the hive.metastore.client.factory.class
configuration to the code initiating the spark session solved the issue for me:
SparkSession spark = SparkSession.builder()
...
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
that's the same configuration defined in aws docs (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) and added to the cluster configuration when checking Use for Hive table metadata
on cluster creation, but for some reason dosn't work as expected (I'm using emr 5.12.0).
Upvotes: 17
Reputation: 1358
Our issue was IAM permissions on the EMR cluster; make sure that the cluster IAM instance profile has full access to glue.
Upvotes: 1
Reputation: 59
I had the same issue: spark-submit
will not discover the AWS Glue libraries, but spark-shell
working on the master node will.
It turns out that my spark-submit
job uses a fat .jar
which was compiled with the standard org.apache.spark
and org.apache.hive
libraries. The jar libraries were being used in stead of the custom classes installed on EMR
.
If this is the case with you, make sure to exclude all:
'org.apache.spark:' 'org.apache.hive:' 'org.apache.hadoop:' modules from you
.jar
Here is the reference I used for .Gradle
: http://unethicalblogger.com/2015/07/15/gradle-goodness-excluding-depends-from-shadow.html.
Adding compileOnly
keyword in front of all spark libraries fixed it.
Upvotes: 5
Reputation: 1592
EMR 5.9.0 has just been released - please give it a shot, it should work for you.
Relevant documentation:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Upvotes: -2