How to create random column by group in pyspark dataframe

Question

I'm trying to use rand() with a window function to create a random set of numbers per group using the below:

df.withColumn("random_groups", F.rand().over(Window.partitionBy("groups")))

This error is being raised however,

AnalysisException: Expression 'rand(4853692135296631772)' not supported within a window function.

Does anyone have any advice as to what I can do here to get my intended output? It looks like this

ID | groups | random_groups
1  |    A   |    0.3
2  |    A   |    0.9
3  |    B   |    0.8

Mike Souder · Accepted Answer

Apparently, F.rand() doesn't work with .over(some_window), but if you aren't doing anything different with the random function per group then it doesn't matter. Just add your random column and do whatever you want to do with the random number later with filters or groupBy.

df = df.withColumn('random_groups', F.rand())
df.groupBy('groups').agg(F.max('random_groups').alias('max_rand')).show()

If you want different random functions per group, you might need something like this:

df = df.withColumn(
    'random_groups',
    F.when(F.col('groups') == 'A', F.rand(seed=69))
     .when(F.col('groups') == 'B', F.randn(seed=42))
     .otherwise(F.lit(-1)) # leave off for null values in other groups
)

How to create random column by group in pyspark dataframe

Answers (1)

Related Questions