Tobias Hermann
Tobias Hermann

Reputation: 10966

How to derive new columns from the values of a given column?

Given a dataframe like the following:

foo    bar    baz
  a      4      x
  b      3      x
  c      5      y
  b      5      y
  a      7      z

I'd like to obtain this:

bar_a    bar_b    bar_c    baz
    4     null     null      x
 null        3     null      x
 null     null        5      y
 null        5     null      y
    7     null     null      z

Is there some function build-in to Spark to achieve this?

Upvotes: 0

Views: 53

Answers (1)

mck
mck

Reputation: 42422

You can do a pivot:

val df2 = df.groupBy($"baz", monotonically_increasing_id().as("id"))
            .pivot("foo")
            .agg(first($"bar"))
            .drop("id")

df2.show
+---+----+----+----+
|baz|   a|   b|   c|
+---+----+----+----+
|  y|null|null|   5|
|  y|null|   5|null|
|  x|   4|null|null|
|  x|null|   3|null|
|  z|   7|null|null|
+---+----+----+----+

If you want pretty column names:

val df3 = df2.toDF(df2.columns.map(c => if (df2.columns.drop(1).contains(c)) "bar_" + c else c):_*)

df3.show
+---+-----+-----+-----+
|baz|bar_a|bar_b|bar_c|
+---+-----+-----+-----+
|  y| null| null|    5|
|  y| null|    5| null|
|  x|    4| null| null|
|  x| null|    3| null|
|  z|    7| null| null|
+---+-----+-----+-----+

Upvotes: 1

Related Questions