nstudenski
nstudenski

Reputation: 316

How can I split an array of structs into columns in Spark?

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).

Here are two sample values of the column.

[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}]
[{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}]

I want to split each into three columns, one for each type. I want something like this:

   Phone1           Phone2           Fax
000-000-0000     000-000-0001     000-000-0002
000-000-1000     000-000-1002     000-000-1001

This answer shows how to put each element of the array into its own column. How to explode an array into multiple columns in Spark

This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.

   Phone1           Phone2           Fax
000-000-0000     000-000-0001     000-000-0002
000-000-1000     000-000-1001     000-000-1002

How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.

Upvotes: 1

Views: 785

Answers (1)

Leo C
Leo C

Reputation: 22449

Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:

case class Contact(number: String, typeCode: String)

val df = Seq(
  (1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))),
  (2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2")))
).toDF("user_id", "contacts")

df.
  withColumn("contact", explode($"contacts")).
  groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")).
  show(false)
// +-------+-----------+-----------+-----------+
// |user_id|Fax        |Phone1     |Phone2     |
// +-------+-----------+-----------+-----------+
// |1      |111-66-7070|111-22-3333|111-44-5555|
// |2      |222-55-6060|222-33-4444|111-77-8888|
// +-------+-----------+-----------+-----------+

Upvotes: 1

Related Questions