spark retain all the columns of orginal data frame after pivot

Question

I have one data frame which has many columns almost 50 plus(as shown below),

+----+----+---+----+----+---+----+---+----+----+---+...
|c1  |c2  |c3 |c4  |c5  |c6  |c7 |c8 |type|clm |val |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |5   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  31| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t2 | b  |6   |...
+----+----+---+----+----+---+----+---+----+----+---+...
|  11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| t1 | a  |9   |...
+----+----+---+----+----+---+----+---+----+----+---+...

I want to convert one of the column values to many columns, so thinking to use below code

df.groupBy("type").pivot("clm").agg(first("val")).show()

this is converting row values in to columns but other columns (c1 to c8) are not coming as part resultant data frame.

so is it okay to do below method to get all cloumns after pivot

df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","type").pivot("clm").agg(first("val")).show()

YoYo · Accepted Answer

pivot is treated like an aggregator, just like any other.

df
  .groupBy("type")
  .agg(
    pivot("clm").first("val"),
    first("c1"),
    first("c2"),
    first("c3"),
    first("c4"),
    first("c5"),
    first("c6"),
    first("c7"),
    first("c8")
  ).show()

Writing it like that assumes that you have duplicated values for c1..c8 within the same type. If not, then the .groupby(...) needs to be tuned for exactly how your data is organized.

spark retain all the columns of orginal data frame after pivot

Answers (1)

Related Questions