Reputation: 23
I am new to pyspark and I am trying to get from here:
+---+-----+--------+
| id|value|ordering|
+---+-----+--------+
| A | 21 | 1 |
| A | 22 | 2 |
| A | 23 | 3 |
| A | 24 | 4 |
| B | 15 | 1 |
| B | 16 | 2 |
| B | 10 | 3 |
| B | 10 | 4 |
+---+-----+--------+
to here:
+---+-------+-------+-------+----------+
| id|value_0|value_1|value_n| value_n+1|
+---+-------+-------+-------+----------+
| A | 21 | 22 | 23 | 24 |
| B | 15 | 16 | 10 | 10 |
+---+-------+-------+-------+----------+
I tried using pivot, but I don't need to aggregate, and I don't understand how to do it without aggregation. I also tried approach outlined here Rolling up multiple rows into a single row and column in spark, and then split col into multiple cols, but I need to keep the order of elements as they appear in the dataframe (I have extra column -'ordering'- I can use to orderBy
). I can't do it with this approach as concat_ws
doesn't preserve order + I can have repetitions.
Any help appreciated
Upvotes: 2
Views: 1382
Reputation: 6654
You can use the first
function as the aggregation.
Can we use the ordering
field directly to create "value_1"
, "value_2"
, so on? If yes, see following.
data_sdf. \
withColumn('pivot_field', func.concat_ws('_', func.lit('value'), 'ordering')). \
groupBy('id'). \
pivot('pivot_field'). \
agg(func.first('value')). \
show()
# +---+-------+-------+-------+-------+
# | id|value_1|value_2|value_3|value_4|
# +---+-------+-------+-------+-------+
# | B| 15| 16| 10| 10|
# | A| 21| 22| 23| 24|
# +---+-------+-------+-------+-------+
If not, create a column with the row_number
and use that. See following.
data_sdf. \
withColumn('rn', func.row_number().over(wd.partitionBy('id').orderBy('ordering'))). \
withColumn('pivot_field', func.concat_ws('_', func.lit('value'), 'rn')). \
groupBy('id'). \
pivot('pivot_field'). \
agg(func.first('value')). \
show()
# +---+-------+-------+-------+-------+
# | id|value_1|value_2|value_3|value_4|
# +---+-------+-------+-------+-------+
# | B| 15| 16| 10| 10|
# | A| 21| 22| 23| 24|
# +---+-------+-------+-------+-------+
This assumes that the data is unique across the columns being used here - as in all values within an ID will have distinct values in ordering. If not, you could change the aggregation to your choice.
Upvotes: 3