Reputation: 403
My DataFrame has a column num_of_items
. It is a count field. Now, I want to convert it to list type from int type.
I tried using array(col)
and even creating a function to return a list by taking int value as input. Didn't work
from pyspark.sql.types import ArrayType
from array import array
def to_array(x):
return [x]
df=df.withColumn("num_of_items", monotonically_increasing_id())
df
col_1 | num_of_items
A | 1
B | 2
Expected output
col_1 | num_of_items
A | [23]
B | [43]
Upvotes: 3
Views: 8604
Reputation: 43494
I tried using array(col)
Using pyspark.sql.functions.array
seems to work for me.
from pyspark.sql.functions import array
df.withColumn("num_of_items", array("num_of_items")).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
and even creating a function to return a list by taking int value as input.
If you want to use the function you created, you have to make it a udf
and specify the return type:
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import udf, col
to_array_udf = udf(to_array, ArrayType(IntegerType()))
df.withColumn("num_of_items", to_array_udf(col("num_of_items"))).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
But it's preferable to avoid using udf
s when possible: See Spark functions vs UDF performance?
Upvotes: 7