Sam Comber
Sam Comber

Reputation: 1293

How to create random column by group in pyspark dataframe

I'm trying to use rand() with a window function to create a random set of numbers per group using the below:

df.withColumn("random_groups", F.rand().over(Window.partitionBy("groups")))

This error is being raised however,

AnalysisException: Expression 'rand(4853692135296631772)' not supported within a window function.

Does anyone have any advice as to what I can do here to get my intended output? It looks like this

ID | groups | random_groups
1  |    A   |    0.3
2  |    A   |    0.9
3  |    B   |    0.8

Upvotes: 0

Views: 1009

Answers (1)

Mike Souder
Mike Souder

Reputation: 541

Apparently, F.rand() doesn't work with .over(some_window), but if you aren't doing anything different with the random function per group then it doesn't matter. Just add your random column and do whatever you want to do with the random number later with filters or groupBy.

df = df.withColumn('random_groups', F.rand())
df.groupBy('groups').agg(F.max('random_groups').alias('max_rand')).show()

If you want different random functions per group, you might need something like this:

df = df.withColumn(
    'random_groups',
    F.when(F.col('groups') == 'A', F.rand(seed=69))
     .when(F.col('groups') == 'B', F.randn(seed=42))
     .otherwise(F.lit(-1)) # leave off for null values in other groups
)

Upvotes: 1

Related Questions