nrvaller
nrvaller

Reputation: 373

Creating spark dataframe with rows holding object with schema

My goal is to have a spark dataframe that holds each of my Candy objects in a separate row, with their respective properties

+------------------------------------+
             main
+------------------------------------+
  {"brand":"brand1","name":"snickers"}
+------------------------------------+
  {"brand":"brand2","name":"haribo"}
+------------------------------------+

Case class for Proof of concept

case class Candy(
                   brand: String,
                   name: String)
val candy1 = Candy("brand1", "snickers")
val candy2 = Candy("brand2", "haribo")

So far I have only managed to put them in the same row with:

import org.json4s.DefaultFormats
import org.json4s.jackson.Serialization.{read, write}
implicit val formats = DefaultFormats

val body = write(Array(candy1, candy2))
val df=Seq(body).toDF("main")

df.show(5, false)

giving me everything in one row instead of 2. How can I split each object up into its own row while maintaining the schema of my Candy object?

+-------------------------------------------------------------------------+
|                     main                                                |
+-------------------------------------------------------------------------+
|[{"brand":"brand1","name":"snickers"},{"brand":"brand2","name":"haribo"}]|
+-------------------------------------------------------------------------+

Upvotes: 1

Views: 216

Answers (1)

SCouto
SCouto

Reputation: 7928

Do you want to keep the item as a json string inside the dataframe?

If you don't, you can do this, taking advatange of the dataset ability to handle case classes:

 val df=Seq(candy1, candy2).toDS

This will give you the following output:

+------+--------+
| brand|    name|
+------+--------+
|brand1|snickers|
|brand2|  haribo|
+------+--------+

IMHO that's the best optionm but if you want to keep your data as a json string, then you can first define a toJson method inside your case class:

case class Candy(brand:String, name: String) {
     def toJson = s"""{"brand": "$brand", "name": "$name" }"""
}

And then build the DF using that method:

val df=Seq(candy1.toJson, candy2.toJson).toDF("main")

OUTPUT

+----------------------------------------+
|main                                    |
+----------------------------------------+
|{"brand": "brand1", "name": "snickers" }|
|{"brand": "brand2", "name": "haribo" }  |
+----------------------------------------+

Upvotes: 2

Related Questions