Maarten
Maarten

Reputation: 239

How to convert the type of a column from String to Date

I have a pretty simple question where I couldn't find a simple answer for: I want to convert the type of a column in my Pyspark DataFrame from a String to a Date, how can I do this?

I tried the following:

df.withColumn('dates', datetime.strpdate(col('date'), %Y%m%d))

and

df.withColumn('dates', datetime.strpdate(df.date, %Y%m%d))

but in each case I get the following error: TypeError: strptime() argument 1 must be str, not Column.

So obviously the col('date') and df.date are interpreted as a Column rather than the string that is holds. How can I fix this?

Upvotes: 1

Views: 1806

Answers (1)

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

You can convert the string column to date using "cast" function if the format is "yyyy-MM-dd" or you can use "to_date" function which is more generalized function where you can specify the input format as well.

Here is the sample code.

Code

# Create DaraFrame
df = spark.createDataFrame([(1, "2020-06-03", "2020/06/03"), (2, "2020-05-01", "2020/05/01")] , ["id", "date_fmt_1", "date_fmt_2"])

# Convert the string columne to date.
df1 = (df
         # approach - 1: use cast function if the format is "yyyy-MM-dd"
         .withColumn("date_1", df["date_fmt_1"].cast("date"))
       # Approach - 2 : use to_date function and specify the input format. "yyyy/MM/dd" in our case 
       .withColumn("date_2", to_date("date_fmt_2", "yyyy/MM/dd"))
       # If you don't specify any format, it will take spark default format "yyyy-MM-dd"
        .withColumn("date_3", to_date(df["date_fmt_1"])))

# Print the schema
df1.printSchema()

Schema Output

root
 |-- id: long (nullable = true)
 |-- date_fmt_1: string (nullable = true)
 |-- date_fmt_2: string (nullable = true)
 |-- date_1: date (nullable = true)
 |-- date_2: date (nullable = true)
 |-- date_3: date (nullable = true)

Display data.

df1.show()

DataFrame Output

+---+----------+----------+----------+----------+----------+
| id|date_fmt_1|date_fmt_2|    date_1|    date_2|    date_3|
+---+----------+----------+----------+----------+----------+
|  1|2020-06-03|2020/06/03|2020-06-03|2020-06-03|2020-06-03|
|  2|2020-05-01|2020/05/01|2020-05-01|2020-05-01|2020-05-01|
+---+----------+----------+----------+----------+----------+

For more information above Spark's DateTime function you can visit this blog:https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-datetime-functions-b66de737950a

I hope this helps.

Upvotes: 5

Related Questions