How to serialize/deserialize case classes from a spark dataset to/from s3

Question

Let's say I have a Dataset[MyData] where MyData is defined as: case class MyData(id: String, listA: List[SomeOtherCaseClass])

I want to save the data to s3 and load back later with MyData case class. I know case class data is serializable. But is it possible to do like:

myData.write.xxxx("s3://someBucket/some")

// later
val myloadedData: Dataset[MyData] = spark.read.yyyy("s3://someBucket/some", MyData)

Luis Miguel Mej&#237;a Su&#225;rez · Accepted Answer

What does serialization means for you?

Because you only need to do exactly what you showed, choosing any available format you like, e.g. csv, json, parquet, orc, ...
(I would recommend doing a benchmarking between ORC and Parquet for your data, to see which one works better for you).

myData.write.orc("s3://someBucket/somePath")

And, when reading, just use the same format to get a DataFrame back, which you can cast to a Dataset[MyData] using the as[T] method.

val myloadedData: Dataset[MyData] = spark.read.orc("s3://someBucket/somePath").as[MyData]

Or, your question was how to connect to S3? - If so, if you are running from EMR then everything will be setup already. You only need to prepend your path with s3://, as you already did.

How to serialize/deserialize case classes from a spark dataset to/from s3

Answers (1)

Related Questions