Reputation: 395
I have got a numpy array from np.select
and I want to store it as a new column in PySpark DataFrame. How can I do that?
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame({'a': [1,2,3], 'b': ['abc', 'cde', 'edf']})
df_data = spark.createDataFrame(pdf, schema='a string, b string')
There are a few conditions and choices for which I use np.select
like
np.select(conditions, choices, default='Other')
This returns the following nd-array
[['val1'], ['val2'], ['val3']]
Now I want to save this nd-array as a new column in df_data
.
Upvotes: 2
Views: 1561
Reputation: 24488
You may try first converting your ndarray to list and providing every element of the list to its appropriate location into Spark array.
ndarray = np.select(conditions, choices, default='Other')
nd_list = ndarray.tolist()
df_data = df_data.withColumn('ndarray', F.array([F.array(F.lit(e[0])) for e in nd_list]))
This way you would create array of arrays which would probably be an equivalent of your list of lists.
Upvotes: 2