Flika205
Flika205

Reputation: 562

Changing Columns DataTypes in a DataFrame and pass it into a UDF - PySpark

I'm currently working on a problem that involves changing the types of several Columns in a DataFrame, but I'm not sure how can I pass it into a udf because the function I have created takes a dictionary as an argument and therefore, I don't know how to pass the function into a udf.

All the data types I currently have are of type String, but as I mentioned, I need to change them to different types such as Integer & Date.

my function looks something like that:

def columns_types_transformer(df, reformating_dict):
    for column, new_type in reformating_dict.items():
        df = df.withColumn(column, df[column].cast(new_type))
    return df

the dictionary I want to pass looks like that:

dictionary = {'date1': DateType(), 'date2': DateType(), 'date3': DateType(), 'date4': DateType(), 'date5': DateType(), 'date6': DateType(), 'integer1': IntegerType()}

My issue here is how to pass the dictionary with the correct types into a udf? Another approach I was thinking of is using SQLTransformer for it, but also not sure how can this be done.

Any help would be appreciated.

Upvotes: 0

Views: 769

Answers (1)

Flika205
Flika205

Reputation: 562

I managed to solve this issue using the SQLTransformer.

This is what I have done

sqlTrans_formatter = SQLTransformer(statement="SELECT CAST(date1 AS date), CAST(date2 AS date), CAST(date3 AS date), CAST(date4 AS date), CAST(date5 AS date), CAST(date6 AS date), CAST(integer1 AS int) FROM __THIS__")

df = sqlTrans_formatter.transform(ddf)

Hopefully it would be helpful for others as well.

Upvotes: 1

Related Questions