Reputation: 160
Have a spark data frame . One of the col has dates populated in the format like 2018-Jan-12
I need to change this structure to 20180112
How can this be achieved
Upvotes: 2
Views: 18896
Reputation: 43494
For Spark version 1.5+
Suppose you had the following DataFrame:
df = sqlCtx.createDataFrame([("2018-Jan-12",)], ["date_str"])
df.show()
#+-----------+
#| date_str|
#+-----------+
#|2018-Jan-12|
#+-----------+
To avoid using udf
s, you can first convert the string to a date:
from pyspark.sql.functions import from_unixtime, unix_timestamp
df = df.withColumn('date', from_unixtime(unix_timestamp('date_str', 'yyyy-MMM-dd')))
df.show()
#+-----------+-------------------+
#| date_str| date|
#+-----------+-------------------+
#|2018-Jan-12|2018-01-12 00:00:00|
#+-----------+-------------------+
Then format the date as a string in your desired format:
from pyspark.sql.functions import date_format, col
df = df.withColumn("new_date_str", date_format(col("date"), "yyyyMMdd"))
df.show()
#+-----------+-------------------+------------+
#| date_str| date|new_date_str|
#+-----------+-------------------+------------+
#|2018-Jan-12|2018-01-12 00:00:00| 20180112|
#+-----------+-------------------+------------+
Or if you prefer, you can chain it all together and skip the intermediate steps:
import pyspark.sql.functions as f
df.select(
f.date_format(
f.from_unixtime(
f.unix_timestamp(
'date_str',
'yyyy-MMM-dd')
),
"yyyyMMdd"
).alias("new_date_str")
).show()
#+------------+
#|new_date_str|
#+------------+
#| 20180112|
#+------------+
Upvotes: 5
Reputation: 10256
You can use Pyspark UDF.
from pyspark.sql import functions as f
from pyspark.sql import types as t
from datetime.datetime import strftime, strptime
df = df.withColumn('date_col', f.udf(lambda d: strptime(d, '%Y-%b-%d').strftime('%Y%m%d'), t.StringType())(f.col('date_col')))
Or, you can define a large function to catch exceptions if needed.
def date_converter(col):
try:
_date = strptime(date_string, '%Y-%b-%d')
str_date = _date.strftime('%Y%m%d')
return str_date
except Exception:
# Some code if needed
return ''
udf_function = f.udf(date_converter, t.StringType())
df = df.withColumn('date_col', udf_function(df.date_col))
note: I am assuming date_col
is the name of your column.
Upvotes: 4
Reputation: 823
Investigate Python's datetime library, and the methods strftime()
and strptime()
: Basic date and time types: trftime() and strptime()
For example, using strftime.org as a reference:
from datetime import datetime
date_string = '2018-Jan-12'
# Assuming day of the month is a zero-padded decimal number
datetime_object = datetime.strptime(date_string, '%Y-%b-%d')
converted_date_string = datetime_object.strftime('%Y%m%d')
Upvotes: 1