Reputation: 13
I want to convert a string column to a date column for a pyspark dataframe as follows :
|Date|
+-------+---
|10-Nov-15|
|11-Oct-17|
I know strptime function would work, but unable to use it with PySpark dataframe.
Upvotes: 0
Views: 1196
Reputation: 4234
You can use the to_date
function.
import pyspark.sql.functions as F
......
df = df.withColumn('Date', F.to_date(F.lpad('Date', 9, '0'), 'dd-MMM-yy'))
df.printSchema()
df.show(truncate=False)
Upvotes: 1
Reputation: 5052
You can directly use to_date , which is natively present in Pyspark
The key is to find the appropriate parsing patterns.
The available DateTime Patterns for Parsing can be found - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Furthermore to handle edge cases , like - 2-Feb-20
and 02-Feb-20
, you can set the timeParserPolicy=LEGACY
as the above link is applicable for Spark 3.x , more info about this can be found here
d = {
'date_col':['10-Oct-15', '11-Oct-17', '2-Feb-20', '02-Feb-20']
}
sparkDF = sql.createDataFrame(pd.DataFrame(d))
sparkDF.show()
+---------+
| date_col|
+---------+
|10-Oct-15|
|11-Oct-17|
| 2-Feb-20|
|02-Feb-20|
+---------+
sparkDF.printSchema()
root
|-- date_col: string (nullable = true)
sql.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
sparkDF = sparkDF.withColumn('date_col_parsed',F.to_date(F.col('date_col'),'dd-MMM-yy'))
+---------+---------------+
| date_col|date_col_parsed|
+---------+---------------+
|10-Oct-15| 2015-10-10|
|11-Oct-17| 2017-10-11|
| 2-Feb-20| 2020-02-02|
|02-Feb-20| 2020-02-02|
+---------+---------------+
sparkDF.printSchema()
root
|-- date_col: string (nullable = true)
|-- date_col_parsed: date (nullable = true)
sparkDF.registerTempTable("TB1")
sql.sql(
"""
SELECT
date_col
,TO_DATE(date_col,'dd-MMM-yy') as date_col_parsed
FROM TB1
"""
).show()
+---------+---------------+
| date_col|date_col_parsed|
+---------+---------------+
|10-Oct-15| 2015-10-10|
|11-Oct-17| 2017-10-11|
| 2-Feb-20| 2020-02-02|
|02-Feb-20| 2020-02-02|
+---------+---------------+
Upvotes: 0