user3243499
user3243499

Reputation: 3161

How to explode two array fields to multiple columns in Spark?

I was referring to How to explode an array into multiple columns in Spark for a similar need.

I am able to use that code for a single array field dataframe, however, when I have a multiple array fields dataframe, I'm not able to convert both to multiple columns.

For example,

dataframe1

+--------------------+----------------------------------+----------------------------------+
|                 f1 |f2                                |f3                                |
+--------------------+----------------------------------+----------------------------------+
|12                  |                              null|                              null|
|13                  |                              null|                              null|
|14                  |                              null|                              null|
|15                  |                              null|                              null|
|16                  |                              null|                              null|
|17                  |                [[Hi, 256, Hello]]|        [[a, b], [a, b, c],[a, b]]|
|18                  |                              null|                              null|
|19                  |                              null|                              null|
+--------------------+----------------------------------+----------------------------------+

I want to convert it to below dataframe:

dataframe2

+--------------------+----------------------------------+----------------------------------+----------------------------------+
|                 f1 |f2_0                              |f3_0                              |f3_1                              |
+--------------------+----------------------------------+----------------------------------+----------------------------------+
|12                  |                              null|                              null|                              null|
|13                  |                              null|                              null|                              null|
|14                  |                              null|                              null|                              null|
|15                  |                              null|                              null|                              null|
|16                  |                              null|                              null|                              null|
|17                  |                  [Hi, 256, Hello]|                            [a, b]|                         [a, b, c]|
|18                  |                              null|                              null|                              null|
|19                  |                              null|                              null|                              null|
+--------------------+----------------------------------+----------------------------------+----------------------------------+

I tried with the following code:

val dataframe2 = dataframe1.select(
  col("f1") +: (0 until 2).map(i => col("f2")(i).alias(s"f2_$i")): _* +: (0 until 2).map(i => col("f3")(i).alias(s"f3_$i")): _*
)

But it is throwing an error saying it is expecting a ")" after the first "_*".

Upvotes: 2

Views: 1092

Answers (2)

Praveen L
Praveen L

Reputation: 987

Shaido answer is already correct and this answer is only an enhancement to that. Here I just added to find the max length of the columns dynamically.

If the column f2 and f3 is already array, there corresponding max array sizes are computed as below.

val s1 = df.select(max(size(df("f2")))).first().getInt(0)
val s2 = df.select(max(size(df("f3")))).first().getInt(0)

Else if the column should be splitted based on the delimiter and further divide into columns, first computing the size as below.

val s1 = df.select(max(size(split(df("f2"), ",")))).first().getInt(0)
val s2 = df.select(max(size(split(df("f3"), ",")))).first().getInt(0)

And then we can use the s1, s2 in the map function in the Shaido answer as (0 until s1).map( .....

Upvotes: 0

Shaido
Shaido

Reputation: 28392

+: is used in Scala to add a single element to a list. It can't be used to concatenate two lists together. Instead, you can use ++ as follows:

val cols = Seq(col("f1")) 
  ++ (0 until 1).map(i => col("f2")(i).alias(s"f2_$i")) 
  ++ (0 until 2).map(i => col("f3")(i).alias(s"f3_$i"))

val dataframe2 = dataframe1.select(cols: _*)

Note that to use this approach, you need to know the number of elements of the lists in advance. Above, I changed 2 to 1 for the f2 column.

Upvotes: 1

Related Questions