scoder
scoder

Reputation: 2611

spark retain all the columns of orginal data frame after pivot

I have one data frame which has many columns almost 50 plus(as shown below),

+----+----+---+----+----+---+----+---+----+----+---+...
|c1  |c2  |c3 |c4  |c5  |c6  |c7 |c8 |type|clm |val |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |5   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  31| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t2 | b  |6   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |9   |...
+----+----+---+----+----+---+----+---+----+----+---+...

I want to convert one of the column values to many columns, so thinking to use below code

df.groupBy("type").pivot("clm").agg(first("val")).show() 

this is converting row values in to columns but other columns (c1 to c8) are not coming as part resultant data frame.

so is it okay to do below method to get all cloumns after pivot

df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","type").pivot("clm").agg(first("val")).show()

Upvotes: 0

Views: 438

Answers (1)

YoYo
YoYo

Reputation: 9415

pivot is treated like an aggregator, just like any other.

df
  .groupBy("type")
  .agg(
    pivot("clm").first("val"),
    first("c1"),
    first("c2"),
    first("c3"),
    first("c4"),
    first("c5"),
    first("c6"),
    first("c7"),
    first("c8")
  ).show()

Writing it like that assumes that you have duplicated values for c1..c8 within the same type. If not, then the .groupby(...) needs to be tuned for exactly how your data is organized.

Upvotes: 1

Related Questions