How can I store a numpy array as a new column in PySpark DataFrame?

Question

I have got a numpy array from np.select and I want to store it as a new column in PySpark DataFrame. How can I do that?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame({'a': [1,2,3], 'b': ['abc', 'cde', 'edf']})
df_data = spark.createDataFrame(pdf, schema='a string, b string')

There are a few conditions and choices for which I use np.select like

np.select(conditions, choices, default='Other')

This returns the following nd-array

[['val1'], ['val2'], ['val3']]

Now I want to save this nd-array as a new column in df_data.

How can I store a numpy array as a new column in PySpark DataFrame?

Answers (1)

Related Questions