LeoPinheiro
LeoPinheiro

Reputation: 53

How to dynamically transpose a single Column to multiple Rows in pyspark?

I have a dataframe that looks like below

ColName
a
b
c
d
e
f
g
h
i
j
k
l

and based on an specific parameter I want to transpose those values into rows. So for example if the parameter value is 3, the new dataframe will look like below

Col1 Col2 Col3
a b c
d e f
g h i
j k l

However if the parameter value is 4, it will look like below

Col1 Col2 Col3 Col4
a b c d
e f g h
i j k l

A few things to notice:

  1. The column names are not important
  2. Both the number of items in that single column and the parameter can change

Any idea how to achieve this in pyspark? Thanks in advance.

Upvotes: 1

Views: 533

Answers (2)

noufel13
noufel13

Reputation: 663

You can use a combination of collect_list and row_number to achieve this.

Step 1: Generate a custom row_number.

from pyspark.sql.functions import floor, row_number, collect_list
from pyspark.sql.window import Window

no_of_columns = 3

df2 = df.withColumn("row_num", floor((row_number().over(Window.orderBy("ColName"))-1)/no_of_columns))

Step 2: Group the data using this row_number and use collect_list to create a list.

df3 = df2.groupBy("row_num").agg(collect_list("ColName").alias("col_list"))

Step 3: Use python's list comprehension to select all the elements from this list.

df3.select(*[df3.col_list[i].alias(f"col{i+1}") for i in range(no_of_columns)]).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   a|   b|   c|
|   d|   e|   f|
|   g|   h|   i|
|   j|   k|   l|
+----+----+----+

Note: The parameter no_of_columns can be changed based on the required number of output columns.

Upvotes: 0

mck
mck

Reputation: 42352

You can add some helper columns to pivot the dataframe:

import pyspark.sql.functions as F

x = 3

result = df.withColumn(
    'id',
    F.monotonically_increasing_id()
).withColumn(
    'id2',
    (F.col('id') / x).cast('int')
).withColumn(
    'id3',
    F.col('id') % x
).groupBy('id2').pivot('id3').agg(F.first('ColName')).orderBy('id2').drop('id2')

result.show()
+---+---+---+
|  0|  1|  2|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  g|  h|  i|
|  j|  k|  l|
+---+---+---+

Upvotes: 1

Related Questions