Reputation: 53
I have a dataframe that looks like below
ColName |
---|
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
k |
l |
and based on an specific parameter I want to transpose those values into rows. So for example if the parameter value is 3, the new dataframe will look like below
Col1 | Col2 | Col3 |
---|---|---|
a | b | c |
d | e | f |
g | h | i |
j | k | l |
However if the parameter value is 4, it will look like below
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
a | b | c | d |
e | f | g | h |
i | j | k | l |
A few things to notice:
Any idea how to achieve this in pyspark? Thanks in advance.
Upvotes: 1
Views: 533
Reputation: 663
You can use a combination of collect_list
and row_number
to achieve this.
Step 1: Generate a custom row_number.
from pyspark.sql.functions import floor, row_number, collect_list
from pyspark.sql.window import Window
no_of_columns = 3
df2 = df.withColumn("row_num", floor((row_number().over(Window.orderBy("ColName"))-1)/no_of_columns))
Step 2: Group the data using this row_number and use collect_list
to create a list.
df3 = df2.groupBy("row_num").agg(collect_list("ColName").alias("col_list"))
Step 3: Use python's list comprehension to select all the elements from this list.
df3.select(*[df3.col_list[i].alias(f"col{i+1}") for i in range(no_of_columns)]).show()
Output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| b| c|
| d| e| f|
| g| h| i|
| j| k| l|
+----+----+----+
Note: The parameter no_of_columns
can be changed based on the required number of output columns.
Upvotes: 0
Reputation: 42352
You can add some helper columns to pivot the dataframe:
import pyspark.sql.functions as F
x = 3
result = df.withColumn(
'id',
F.monotonically_increasing_id()
).withColumn(
'id2',
(F.col('id') / x).cast('int')
).withColumn(
'id3',
F.col('id') % x
).groupBy('id2').pivot('id3').agg(F.first('ColName')).orderBy('id2').drop('id2')
result.show()
+---+---+---+
| 0| 1| 2|
+---+---+---+
| a| b| c|
| d| e| f|
| g| h| i|
| j| k| l|
+---+---+---+
Upvotes: 1