Patrick
Patrick

Reputation: 13

How to convert a String column to Date column for a PySpark dataframe using strptime function?

I want to convert a string column to a date column for a pyspark dataframe as follows :

|Date| +-------+--- |10-Nov-15|
|11-Oct-17|

I know strptime function would work, but unable to use it with PySpark dataframe.

Upvotes: 0

Views: 1196

Answers (2)

过过招
过过招

Reputation: 4234

You can use the to_date function.

import pyspark.sql.functions as F
......

df = df.withColumn('Date', F.to_date(F.lpad('Date', 9, '0'), 'dd-MMM-yy'))
df.printSchema()
df.show(truncate=False)

Upvotes: 1

Vaebhav
Vaebhav

Reputation: 5052

You can directly use to_date , which is natively present in Pyspark

The key is to find the appropriate parsing patterns.

The available DateTime Patterns for Parsing can be found - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Furthermore to handle edge cases , like - 2-Feb-20 and 02-Feb-20 , you can set the timeParserPolicy=LEGACY as the above link is applicable for Spark 3.x , more info about this can be found here

Data Preparation

d = {
        'date_col':['10-Oct-15', '11-Oct-17', '2-Feb-20', '02-Feb-20']
    }

sparkDF = sql.createDataFrame(pd.DataFrame(d))

sparkDF.show()
+---------+
| date_col|
+---------+
|10-Oct-15|
|11-Oct-17|
| 2-Feb-20|
|02-Feb-20|
+---------+

sparkDF.printSchema()

root
 |-- date_col: string (nullable = true)

To Date

sql.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

sparkDF = sparkDF.withColumn('date_col_parsed',F.to_date(F.col('date_col'),'dd-MMM-yy'))

+---------+---------------+
| date_col|date_col_parsed|
+---------+---------------+
|10-Oct-15|     2015-10-10|
|11-Oct-17|     2017-10-11|
| 2-Feb-20|     2020-02-02|
|02-Feb-20|     2020-02-02|
+---------+---------------+

sparkDF.printSchema()

root
 |-- date_col: string (nullable = true)
 |-- date_col_parsed: date (nullable = true)

To Date - SparkSQL

sparkDF.registerTempTable("TB1")

sql.sql(
    """
    SELECT
         date_col
        ,TO_DATE(date_col,'dd-MMM-yy') as date_col_parsed
    FROM TB1
    """
).show()

+---------+---------------+
| date_col|date_col_parsed|
+---------+---------------+
|10-Oct-15|     2015-10-10|
|11-Oct-17|     2017-10-11|
| 2-Feb-20|     2020-02-02|
|02-Feb-20|     2020-02-02|
+---------+---------------+

Upvotes: 0

Related Questions