Reputation: 165
I want to read a csv file using spark. The file's path has blank spaces. Spark is replacing the blank spaces with %20
.
This is the code:
val tmpDF = spark.read.format("com.databricks.spark.csv").option("multiLine", value = true).option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").option("delimiter", delimiter).load(filename)
tmpDF.show(10)
So when the tmpDF.show(10)
method is executed the following error is thrown:
java.io.FileNotFoundException: No such file or directory: s3://{bucket_name}/all/Proposal%20and%20pre-approval/filen_name_20190826-215950.csv
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running REFRESH TABLE tableName
command in SQL or by recreating the Dataset/DataFrame involved."
I checked in s3 and the file does exist but the path has a regular space instead of %20
.
Any idea how to handle this? I can't change the paths because they are produced by a component that I can't modify.
Upvotes: 4
Views: 4866
Reputation: 6974
This is the typical problem of url encoding. The URL coming from S3 is encoded with %20. However, spark incorrectly decodes that.
There had been two issues regarding this
The issues have been resolved in spark2.3 version. If you are using older version
You need to escape the file names after decode the url.
Upvotes: 3