Reputation: 533
I'm using Spark 2.4.0 on EMR and trying to store simple Dataframe in s3 using AWS Glue Data Catalog. The code is below:
val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.testtableemr")
Above code works as expected- data is filtered and stored in s3 directory that is linked with AWS Glue table emrdb.testtableemr. The issue I got is: although it works correctly it still throws below exception
scala> filtered.repartition(1).write.format("hive").mode("append").saveAsTable("emrdb.testtableemr")
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Can not create a Path from an empty string;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:843)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadTable(ExternalCatalogWithListener.scala:159)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:259)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
at org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:66)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:465)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:444)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
... 49 elided
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
at org.apache.hadoop.fs.Path.<init>(Path.java:175)
at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:534)
at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:497)
at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:485)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1669)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:878)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:780)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:780)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:779)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:845)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:843)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 74 more
I got same error using insertInto
method:
filtered.repartition(1).write.mode("append").insertInto("emrdb.testtableemr")
Can you please help me understand meaning of this Exception in this context and suggest way how to fix this?
Thanks in advance!
Regards Andrzej
Upvotes: 5
Views: 12631
Reputation: 988
I had a similar issue and my solution was to go to the settings of the database in Glue Data Catalog ('emrdb' in this case for you) and add a Location URI (it can be a random one). Then I can create tables without specifying this LOCATION as I was doing before without Glue Data Catalog and everything is working fine.
It is described in the official documentation: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Having a default database without a location URI causes failures when you create a table. As a workaround, use the LOCATION clause to specify a bucket location, such as s3://mybucket, when you use CREATE TABLE. Alternatively create tables within a database other than the default database.
Upvotes: 2
Reputation: 5124
The issue is happening because it is missing s3 path in your dataframe writer statement. Passing s3 path as shown below will fix this issue.
val peopleTable = spark.sql("select * from emrdb.testtableemr")
val filtered = peopleTable.filter("name = 'Andrzej'")
filtered.repartition(1).write.option("path","s3://testbucket/testpath/").mode("append").saveAsTable("emrdb.testtableemr")
Upvotes: 8