PySpark DataFrame - Append Random Permutation of a Single Column

Question

I'm using PySpark (a new thing for me). Now, suppose I Have the following table: +-------+-------+----------+ | Col1 | Col2 | Question | +-------+-------+----------+ | val11 | val12 | q1 | | val21 | val22 | q2 | | val31 | val32 | q3 | +-------+-------+----------+ and I would like to append to it a new column, random_qustion which is in fact a permutation of the values in the Question column, so the result might look like this: +-------+-------+----------+-----------------+ | Col1 | Col2 | Question | random_question | +-------+-------+----------+-----------------+ | val11 | val12 | q1 | q2 | | val21 | val22 | q2 | q3 | | val31 | val32 | q3 | q1 | +-------+-------+----------+-----------------+ I'v tried to do that as follow: python df.withColumn( 'random_question' ,df.orderBy(rand(seed=0))['question'] ).createOrReplaceTempView('with_random_questions') The problem is that the above code does append the required column but WITHOUT permuting the values in it.

What am I doing wrong and how can I fix this?

Thank you,

Gilad

Sequinex · Accepted Answer

This should do the trick:

import pyspark.sql.functions as F

questions = df.select(F.col('Question').alias('random_question'))
random = questions.orderBy(F.rand())

Give the dataframes a unique row id:

df = df.withColumn('row_id', F.monotonically_increasing_id())
random = random.withColumn('row_id', F.monotonically_increasing_id())

Join them by row id:

final_df = df.join(random, 'row_id')

PySpark DataFrame - Append Random Permutation of a Single Column

Answers (2)

Related Questions