SarwatFatimaM
SarwatFatimaM

Reputation: 395

How can I store a numpy array as a new column in PySpark DataFrame?

I have got a numpy array from np.select and I want to store it as a new column in PySpark DataFrame. How can I do that?

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame({'a': [1,2,3], 'b': ['abc', 'cde', 'edf']})
df_data = spark.createDataFrame(pdf, schema='a string, b string')

There are a few conditions and choices for which I use np.select like

np.select(conditions, choices, default='Other') 

This returns the following nd-array

[['val1'], ['val2'], ['val3']]

Now I want to save this nd-array as a new column in df_data.

Upvotes: 2

Views: 1561

Answers (1)

ZygD
ZygD

Reputation: 24488

You may try first converting your ndarray to list and providing every element of the list to its appropriate location into Spark array.

ndarray = np.select(conditions, choices, default='Other')
nd_list = ndarray.tolist()
df_data = df_data.withColumn('ndarray', F.array([F.array(F.lit(e[0])) for e in nd_list]))

This way you would create array of arrays which would probably be an equivalent of your list of lists.

Upvotes: 2

Related Questions