Reputation: 1130
I am in a tricky situation.
I have two columns with string type
and it is supposed to be dateType
in the final path. So, to achieve that, I am passing it as a python dictionary
d = {"col1": DateType(), "col2": DateType()}
The above is passed to a data frame where I am supposed to do this transformation.
df = func.cast_attribute(df2, p_property.d)
The col1, and col2 are in the "yyyy-MM-dd"
format. But as per the business requirement, I need the "dd-mm-yyyy"
format. To do that I was suggested to use pyspark in-built function date_format()
col_df = df.withColumn("col1", date_format("col1", format="dd-MM-yyyy"))
But the problem is that the above transformation to convert yyyy-MM-dd
to dd-MM-yyyy
is converting the data type of the two cols back to string
Note func.cast_attribute()
def cast_attribute(df, mapper={}):
""" Cast columns of the dataframe \n
Usage: cast_attribute(your_df, {'column_name', StringType()})
Args:
df (DataFrame):DataFrame to convert some of the attributes.
mapper (dict, optional): Dictionary with old column datatype to new datatype.
Returns:
DataFrame: Returns the dataframe with converted columns
"""
for column, dtype in mapper.items():
df = df.withColumn(column, col(column).cast(dtype))
return df
Please suggest, how to keep the date type in the final data frame..
Upvotes: 1
Views: 225
Reputation: 10406
A date object does not have a format attached to it. It is just a date. The format only makes sense when you want to print that date somewhere (console, text file...). In that case, spark needs to decide how to transform that date into a string before printing it. By default, spark uses its default format yyyy-MM-dd
.
Here is the documentation of date_format
:
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
Therefore, if you want to keep the date type, do not use date_format
. If your problem is how the date is serialized when written to a CSV or a JSON file, you can configure the format with the dateFormat
option of the DataFrameWritter
.
df = spark.range(1).withColumn("date", F.current_date())
df.write.option("dateFormat", "dd-MM-yyyy").csv("data.csv")
> cat data.csv/*
0,31-12-2022
Upvotes: 1