Powers
Powers

Reputation: 19308

PySpark cast ArrayType(ArrayType(NoneType)) to ArrayType(ArrayType(IntegerType))

I have the following Spark DataFrame and would like the change the type of the nums column:

+---------+----------------------------+
|firstname|nums                        |
+---------+----------------------------+
|James    |[[null, null], [null, null]]|
|Michael  |[[null, null], [null, null]]|
+---------+----------------------------+

Here's the type of nums: StructField('nums', ArrayType(ArrayType(NullType(), True), True), True)

This is what I tried:

desired_type = StructField("nums", ArrayType(ArrayType(IntegerType(), True), True), True)
df = df.withColumn("nums", col("nums").cast(desired_type))

This is the error I got: IllegalArgumentException: Failed to convert the JSON string '{"metadata":{},"name":"nums","nullable":true,"type":{"containsNull":true,"elementType":{"containsNull":true,"elementType":"integer","type":"array"},"type":"array"}}' to a data type.

Here's the full example:

data2 = [
    ("James", [[None, None], [None, None]]),
    ("Michael", [[None, None], [None, None]]),
]

schema = StructType(
    [
        StructField("firstname", StringType(), True),
        StructField("nums", ArrayType(ArrayType(NullType(), True), True), True),
    ]
)

df = spark.createDataFrame(data=data2, schema=schema)

desired_type = StructField("nums", ArrayType(ArrayType(IntegerType(), True), True), True)
df = df.withColumn("nums", col("nums").cast(desired_type))

Upvotes: 2

Views: 240

Answers (2)

boyangeor
boyangeor

Reputation: 1151

The desired_type should be created like this:

desired_type = ArrayType(ArrayType(IntegerType(), True), True)
df = df.withColumn("nums", F.col("nums").cast(desired_type))
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- nums: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: integer (containsNull = true)

Upvotes: 2

notNull
notNull

Reputation: 31470

Can you try by casting array<array<int>> instead of using struct type:

Example:

df = df.withColumn("nums", col("nums").cast("array<array<int>>"))
print(df.schema)
df.printSchema()
#StructType([StructField('firstname', StringType(), True), StructField('nums', #ArrayType(ArrayType(IntegerType(), True), True), True)])
#root
# |-- firstname: string (nullable = true)
# |-- nums: array (nullable = true)
# |    |-- element: array (containsNull = true)
# |    |    |-- element: integer (containsNull = true)

Upvotes: 1

Related Questions