Reading Files from S3 Bucket to PySpark Dataframe Boto3

Question

How can I load a bunch of files from a S3 bucket into a single PySpark dataframe? I'm running on an EMR instance. If the file is local, I can use the SparkContext textFile method. But when the file is on S3, how can I use boto3 to load multiple files of various types (CSV, JSON, ...) into a single dataframe for processing?

OneCricketeer · Accepted Answer

Spark natively reads from S3 using Hadoop APIs, not Boto3. And textFile is for reading RDD, not DataFrames. Also do not try to load two different formats into a single dataframe as you won't be able to consistently parse them

I would suggest using

csvDf = spark.read.csv("s3a://path/to/files/*.csv")
jsonDf = spark.read.json("s3a://path/to/files/*.json")

And from there, you can filter and join the dataframes using SparkSQL.

Note: JSON files need to contain single JSON objects each on their own line

Reading Files from S3 Bucket to PySpark Dataframe Boto3

Answers (1)

Related Questions