iyerland
iyerland

Reputation: 642

AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/. The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10. Any ideas?

17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
    at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126)
    at org.apache.hadoop.fs.Path.<init>(Path.java:134)
    at org.apache.hadoop.fs.Path.<init>(Path.java:93)
    at org.apache.hadoop.fs.Path.suffix(Path.java:361)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:138)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:82)

code that writes orc:

dataframe.write
   .partitionBy(partition)
   .option("compression", ZLIB.toString)
   .mode(SaveMode.Overwrite)
   .orc(destination)

Upvotes: 3

Views: 7327

Answers (1)

asmaier
asmaier

Reputation: 11746

I have seen a similar problem when writing parquet files to S3. The problem is the SaveMode.Overwrite. This mode doesn't seem to work correctly in combination with S3. Try to delete all the data in your S3 bucket my-bucket-name before writing into it. Then your code should run successfully.

To delete all files from your bucket my-bucket-name you can use the following pyspark code:

# see https://www.quora.com/How-do-you-overwrite-the-output-directory-when-using-PySpark
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem

# see http://crazyslate.com/how-to-rename-hadoop-files-using-wildcards-while-patterns/
fs = FileSystem.get(URI("s3a://my-bucket-name"), sc._jsc.hadoopConfiguration())
file_status = fs.globStatus(Path("/*"))
for status in file_status:
    fs.delete(status.getPath(), True)

Upvotes: 4

Related Questions