why the pyspark function is changing the data type of columns in pyspark?

Question

I am in a tricky situation. I have two columns with string type and it is supposed to be dateType in the final path. So, to achieve that, I am passing it as a python dictionary

d = {"col1": DateType(), "col2": DateType()}

The above is passed to a data frame where I am supposed to do this transformation.

df = func.cast_attribute(df2, p_property.d)

The col1, and col2 are in the "yyyy-MM-dd" format. But as per the business requirement, I need the "dd-mm-yyyy" format. To do that I was suggested to use pyspark in-built function date_format()

col_df = df.withColumn("col1", date_format("col1", format="dd-MM-yyyy"))

But the problem is that the above transformation to convert yyyy-MM-dd to dd-MM-yyyy is converting the data type of the two cols back to string

Note func.cast_attribute()

def cast_attribute(df, mapper={}):
    """ Cast columns of the dataframe 

        Usage: cast_attribute(your_df, {'column_name', StringType()})
    Args:
        df (DataFrame):DataFrame to  convert some of the attributes.
        mapper (dict, optional): Dictionary with old column datatype to new datatype.
    Returns:
        DataFrame: Returns the dataframe with  converted  columns
    """
    for column, dtype in mapper.items():
        df = df.withColumn(column, col(column).cast(dtype))
    return df

Please suggest, how to keep the date type in the final data frame..

Oli · Accepted Answer

A date object does not have a format attached to it. It is just a date. The format only makes sense when you want to print that date somewhere (console, text file...). In that case, spark needs to decide how to transform that date into a string before printing it. By default, spark uses its default format yyyy-MM-dd.

Here is the documentation of date_format:

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

Therefore, if you want to keep the date type, do not use date_format. If your problem is how the date is serialized when written to a CSV or a JSON file, you can configure the format with the dateFormat option of the DataFrameWritter.

df = spark.range(1).withColumn("date", F.current_date())
df.write.option("dateFormat", "dd-MM-yyyy").csv("data.csv")

> cat data.csv/*
0,31-12-2022

why the pyspark function is changing the data type of columns in pyspark?

Answers (1)

Related Questions