Vincent Claes
Vincent Claes

Reputation: 4768

Reading data from s3 subdirectories in PySpark

I want to read all parquet files from an S3 bucket, including all those in the subdirectories (these are actually prefixes).

Using wildcards (*) in the S3 url only works for the files in the specified folder. For example using this code will only read the parquet files below the target/ folder.

df = spark.read.parquet("s3://bucket/target/*.parquet")
df.show()

Let say i have a structure like this in my s3 bucket:

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

The above code will raise the exception:

pyspark.sql.utils.AnalysisException: 'Path does not exist: s3://mailswitch-extract-underwr-prod/target/*.parquet;'

How can I read all the parquet files from the subdirectories from my s3 bucket?

To run my code, I am using AWS Glue 2.0 with Spark 2.4 and python 3.

Upvotes: 3

Views: 17564

Answers (3)

Andrea Nerla
Andrea Nerla

Reputation: 61

For those like me in search of an answer for "How can I read all the files in my s3 bucket using PySpark?", the answer (following OP's example) simply is

df = spark.read.parquet("s3://bucket/target/")

Upvotes: 1

Vincent Claes
Vincent Claes

Reputation: 4768

If you want to read all parquet files below the target folder

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

you can do

df = spark.read.parquet("bucket/target/*/*/*/*.parquet")

The downside is that you need to know the depth of your parquet files.

Upvotes: 5

This worked for me:

df = spark.read.parquet("s3://your/path/here/some*wildcard")

Upvotes: 0

Related Questions