Iurii Bakach
Iurii Bakach

Reputation: 23

pyspark pivot without aggregation function

I am new to pyspark and I am trying to get from here:

+---+-----+--------+
| id|value|ordering|
+---+-----+--------+ 
| A | 21  | 1      |
| A | 22  | 2      |
| A | 23  | 3      |
| A | 24  | 4      |
| B | 15  | 1      |
| B | 16  | 2      |
| B | 10  | 3      |
| B | 10  | 4      |
+---+-----+--------+

to here:

+---+-------+-------+-------+----------+
| id|value_0|value_1|value_n| value_n+1|
+---+-------+-------+-------+----------+
| A | 21    | 22    | 23    |  24      |
| B | 15    | 16    | 10    |  10      |
+---+-------+-------+-------+----------+

I tried using pivot, but I don't need to aggregate, and I don't understand how to do it without aggregation. I also tried approach outlined here Rolling up multiple rows into a single row and column in spark, and then split col into multiple cols, but I need to keep the order of elements as they appear in the dataframe (I have extra column -'ordering'- I can use to orderBy). I can't do it with this approach as concat_ws doesn't preserve order + I can have repetitions.

Any help appreciated

Upvotes: 2

Views: 1382

Answers (1)

samkart
samkart

Reputation: 6654

You can use the first function as the aggregation.

Can we use the ordering field directly to create "value_1", "value_2", so on? If yes, see following.

data_sdf. \
    withColumn('pivot_field', func.concat_ws('_', func.lit('value'), 'ordering')). \
    groupBy('id'). \
    pivot('pivot_field'). \
    agg(func.first('value')). \
    show()

# +---+-------+-------+-------+-------+
# | id|value_1|value_2|value_3|value_4|
# +---+-------+-------+-------+-------+
# |  B|     15|     16|     10|     10|
# |  A|     21|     22|     23|     24|
# +---+-------+-------+-------+-------+

If not, create a column with the row_number and use that. See following.

data_sdf. \
    withColumn('rn', func.row_number().over(wd.partitionBy('id').orderBy('ordering'))). \
    withColumn('pivot_field', func.concat_ws('_', func.lit('value'), 'rn')). \
    groupBy('id'). \
    pivot('pivot_field'). \
    agg(func.first('value')). \
    show()

# +---+-------+-------+-------+-------+
# | id|value_1|value_2|value_3|value_4|
# +---+-------+-------+-------+-------+
# |  B|     15|     16|     10|     10|
# |  A|     21|     22|     23|     24|
# +---+-------+-------+-------+-------+

This assumes that the data is unique across the columns being used here - as in all values within an ID will have distinct values in ordering. If not, you could change the aggregation to your choice.

Upvotes: 3

Related Questions