HHH
HHH

Reputation: 6475

Add a column to an existing dataframe with random fixed values in Pyspark

I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4) and I'd like to randomly pick one of the values for each row.

How can I do that?

Upvotes: 1

Views: 3767

Answers (1)

Jeff
Jeff

Reputation: 2228

Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf:

from pyspark.sql.functions import udf
import numpy as np

df = <original df>

udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)

Upvotes: 2

Related Questions