Ftagn
Ftagn

Reputation: 315

Reading multiple files in a minio bucket with spark

I'm trying to read multiple files with Spark The files are avro files and are stored in a Minio bucket named datalake

I'm using : Spark 2.2.1 compiled without hadoop

Minio (latest minio/minio docker)

2 packages : com.databricks:spark-avro_2.11:4.0.0 and org.apache.hadoop:hadoop-aws:2.8.3

I'm currently testing with pyspark :

PYSPARK_PYTHON=python3 /usr/local/spark/pyspark --packages com.databricks:spark-avro_2.11:4.0.0,org.apache.hadoop:hadoop-aws:2.8.3

Init of the connection with Minio :

AWS_ID='localKey'
AWS_KEY='localSecret'
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ID)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "http://127.0.0.1:9000")

When the files are stored directly in the bucket root, i can use wildcards like this :

DF_RANDOM = spark.read.format("com.databricks.spark.avro").load("s3a://datalake/random-random_table+0+000000001*.avro")

the result is OK :

DF_RANDOM.show()
+-----+-------------------+---+-------------+
|index|                  A|  B|    timestamp|
+-----+-------------------+---+-------------+
|   12| 0.5680445610939323|  1|1530017325000|
|   13|  0.925596638292661|  5|1530017325000|
|   14|0.07103605819788694|  4|1530017325000|
|   15|0.08712929970154071|  7|1530017325000|
+-----+-------------------+---+-------------+

However, if the files are stored in a subfolder :

DF_RANDOM = spark.read.format("com.databricks.spark.avro").load("s3a://datalake/random/random-random_table+0+000000001*.avro")

An error occurs :

Py4JJavaError: An error occurred while calling o111.load. : java.nio.file.AccessDeniedException: s3a://datalake/random: getFileStatus on s3a://datalake/random: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null), S3 Extended Request ID: null

I don't understand why ? The subfolders are created by a kafka connector

Note that if i don't use wildcards, i can access files stored in these subfolder, like this :

DF_RANDOM = spark.read.format("com.databricks.spark.avro").load("s3a://datalake/random/random-random_table+0+0000000012.avro")

Is there any policy or access to set ? spark.read instruction seems to read s3://datalake/random like as a file, but it's a folder to browse

Any idea ?

Thanks anyway

Upvotes: 1

Views: 3022

Answers (1)

Ftagn
Ftagn

Reputation: 315

It was a Minio issue

Fixed in release 2018-06-26T17:56:31Z

https://github.com/minio/minio/pull/5966

Upvotes: 0

Related Questions