Annie
Annie

Reputation: 165

Spark Error reading csv file with spaces in the path/file name

I want to read a csv file using spark. The file's path has blank spaces. Spark is replacing the blank spaces with %20.

This is the code:

val tmpDF = spark.read.format("com.databricks.spark.csv").option("multiLine", value = true).option("quote", "\"").option("escape", "\"").option("header", "true").option("inferSchema", "true").option("delimiter", delimiter).load(filename)

tmpDF.show(10)

So when the tmpDF.show(10) method is executed the following error is thrown:

java.io.FileNotFoundException: No such file or directory: s3://{bucket_name}/all/Proposal%20and%20pre-approval/filen_name_20190826-215950.csv 

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running REFRESH TABLE tableName command in SQL or by recreating the Dataset/DataFrame involved."

I checked in s3 and the file does exist but the path has a regular space instead of %20.

Any idea how to handle this? I can't change the paths because they are produced by a component that I can't modify.

Upvotes: 4

Views: 4866

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6974

This is the typical problem of url encoding. The URL coming from S3 is encoded with %20. However, spark incorrectly decodes that.

There had been two issues regarding this

  1. https://jira.apache.org/jira/browse/SPARK-23148
  2. https://jira.apache.org/jira/browse/SPARK-24320

The issues have been resolved in spark2.3 version. If you are using older version

You need to escape the file names after decode the url.

Upvotes: 3

Related Questions