Using the PySpark 3 DataFrame#transform method with arguments

Question

This question talks about how to chain custom PySpark 2 transformations.

The DataFrame#transform method was added to the PySpark 3 API.

This code snippet shows a custom transformation that doesn't take arguments and is working as expected and another custom transformation that takes arguments and is not working.

from pyspark.sql.functions import col, lit

df = spark.createDataFrame([(1, 1.0), (2, 2.)], ["int", "float"])

def with_funny(word):
    def inner(df):
        return df.withColumn("funny", lit(word))
    return inner

def cast_all_to_int(input_df):
    return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])

df.transform(with_funny("bumfuzzle")).transform(cast_all_to_int).show()

Here's what's outputted:

+---+-----+-----+
|int|float|funny|
+---+-----+-----+
|  1|    1| null|
|  2|    2| null|
+---+-----+-----+

How should the with_funny() method be defined to output a value for the PySpark 3 API?

id3a · Accepted Answer

If I understood, your first transform method will add a new column with a literal that is passed as an argument and the last transform casts all the columns to int type, correct?

casting a string to int will return a null value, your final output is correct:

from pyspark.sql.functions import col, lit

df = spark.createDataFrame([(1, 1.0), (2, 2.)], ["int", "float"])

def with_funny(word):
    def inner(df):
        return df.withColumn("funny", lit(word))
    return inner

def cast_all_to_int(input_df):
    return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])

#first transform
df1 = df.transform(with_funny("bumfuzzle"))
df1.show()

#second transform
df2 = df1.transform(cast_all_to_int)
df2.show()

#all together
df_final = df.transform(with_funny("bumfuzzle")).transform(cast_all_to_int)
df_final.show()

Output:

+---+-----+---------+
|int|float|    funny|
+---+-----+---------+
|  1|  1.0|bumfuzzle|
|  2|  2.0|bumfuzzle|
+---+-----+---------+

+---+-----+-----+
|int|float|funny|
+---+-----+-----+
|  1|    1| null|
|  2|    2| null|
+---+-----+-----+

+---+-----+-----+
|int|float|funny|
+---+-----+-----+
|  1|    1| null|
|  2|    2| null|
+---+-----+-----+

Maybe what you want is switch the order of your transformations like this:

df_final = df.transform(cast_all_to_int).transform(with_funny("bumfuzzle"))
df_final.show()

Output:

+---+-----+---------+
|int|float|    funny|
+---+-----+---------+
|  1|    1|bumfuzzle|
|  2|    2|bumfuzzle|
+---+-----+---------+

Using the PySpark 3 DataFrame#transform method with arguments

Answers (2)

Related Questions