Emma
Emma

Reputation: 403

Convert int column to list type pyspark

My DataFrame has a column num_of_items. It is a count field. Now, I want to convert it to list type from int type.

I tried using array(col) and even creating a function to return a list by taking int value as input. Didn't work

from pyspark.sql.types import ArrayType
from array import array

def to_array(x):
    return [x]

df=df.withColumn("num_of_items", monotonically_increasing_id())

df

col_1    | num_of_items
A        |  1
B        |  2

Expected output

col_1    | num_of_items
A        | [23]
B        | [43]

Upvotes: 3

Views: 8604

Answers (1)

pault
pault

Reputation: 43494

I tried using array(col)

Using pyspark.sql.functions.array seems to work for me.

from pyspark.sql.functions import array
df.withColumn("num_of_items", array("num_of_items")).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#|    A|         [1]|
#|    B|         [2]|
#+-----+------------+

and even creating a function to return a list by taking int value as input.

If you want to use the function you created, you have to make it a udf and specify the return type:

from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import udf, col

to_array_udf = udf(to_array, ArrayType(IntegerType()))
df.withColumn("num_of_items", to_array_udf(col("num_of_items"))).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#|    A|         [1]|
#|    B|         [2]|
#+-----+------------+

But it's preferable to avoid using udfs when possible: See Spark functions vs UDF performance?

Upvotes: 7

Related Questions