Reputation: 683
Let's say I have a Dataset[MyData] where MyData is defined as:
case class MyData(id: String, listA: List[SomeOtherCaseClass])
I want to save the data to s3 and load back later with MyData case class. I know case class data is serializable. But is it possible to do like:
myData.write.xxxx("s3://someBucket/some")
// later
val myloadedData: Dataset[MyData] = spark.read.yyyy("s3://someBucket/some", MyData)
Upvotes: 0
Views: 462
Reputation: 22850
What does serialization means for you?
Because you only need to do exactly what you showed, choosing any available format you like, e.g. csv
, json
, parquet
, orc
, ...
(I would recommend doing a benchmarking between ORC and Parquet for your data, to see which one works better for you).
myData.write.orc("s3://someBucket/somePath")
And, when reading, just use the same format to get a DataFrame
back, which you can cast to a Dataset[MyData]
using the as[T]
method.
val myloadedData: Dataset[MyData] = spark.read.orc("s3://someBucket/somePath").as[MyData]
Or, your question was how to connect to S3? - If so, if you are running from EMR then everything will be setup already. You only need to prepend your path with s3://
, as you already did.
Upvotes: 1