AIBball
AIBball

Reputation: 281

How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala?

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.

I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?

Input dataset: input_dataset

Name Type is String. Ticket Number Type is Int

+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam|      123      |     AB      |
|Adam|      456      |     AC      |
|Adam|      789      |     AD      |
|Bob |     1234      |     BA      |
|Bob |     5678      |     BB      |
|Sam |      987      |     CA      |
|Sam |      654      |     CB      |
|Sam |      321      |     CC      |
|Sam |      876      |     CD      |
+----+---------------+-------------+

Output dataset

Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]

+----+-----------------------------------------------------+
|Name| Purchase Order Summary                              |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD))          | 
|Bob |((Bob,1234,BA),(Bob,5678,BB))                        |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+

Upvotes: 0

Views: 152

Answers (1)

Dasph
Dasph

Reputation: 446

Pretty sure Spark has a map method.

So you could just create a case class

case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)

and instantiate it inside a map from your DF, then collect it into a list.

df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList

collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].

Upvotes: 0

Related Questions