Reputation: 6475
I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4
) and I'd like to randomly pick one of the values for each row.
How can I do that?
Upvotes: 1
Views: 3767
Reputation: 2228
Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf
:
from pyspark.sql.functions import udf
import numpy as np
df = <original df>
udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)
Upvotes: 2