Reputation: 1585
Suppose there is a dateframe with a column comprised of dates as strings. For that assumption, we create the following dataFrame as an example:
# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField, DoubleType, FloatType, DateType
from pyspark.sql.functions import date_format
import random
import time
def strTimeProp(start, end, format, prop):
stime = time.mktime(time.strptime(start, format)) # Parse a string representing a time according to a format
etime = time.mktime(time.strptime(end, format))
ptime = stime + prop * (etime - stime)
return time.strftime(format, time.localtime(ptime))
def randomDate(start, end, prop):
return strTimeProp(start, end, '%m-%d-%Y', prop)
# Creación de un dataframe de prueba:
schema = StructType(
[
StructField("dates1", StringType(), True),
StructField("dates2", StringType(), True)
]
)
size = 32
numCol1 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
numCol2 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
# Building dataFrame:
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(list(zip(numCol1, numCol2)),schema=schema)
df.show(5)
In the code above, a random date column is generated, here is an example:
+----------+----------+
| dates1| dates2|
+----------+----------+
|12-21-1991|05-30-1991|
|05-28-1991|01-23-1991|
|03-01-1991|08-05-1991|
|07-15-1991|05-13-1991|
|07-21-1991|11-10-1991|
+----------+----------+
What I am trying to do is to change date format with the following code (from pySpark documentation):
# Changing date formats:
df.select(date_format('dates1', 'MM-dd-yyy').alias('newFormat')).show(5)
But I get this bad result:
+---------+
|newFormat|
+---------+
| null|
| null|
| null|
| null|
| null|
+---------+
I suppose there is a problem relate with the string dataType but at same time, I don't understand why this code bellow works and the code above don't.
fechas = ['1000-01-01', '1000-01-15']
df = sqlContext.createDataFrame(list(zip(fechas, fechas)), ['dates', 'd'])
df.show()
# Changing date formats:
df.select(date_format('dates', 'MM-dd-yyy').alias('newFormat')).show()
Output:
+----------+----------+
| dates| d|
+----------+----------+
|1000-01-01|1000-01-01|
|1000-01-15|1000-01-15|
+----------+----------+
+----------+
| newFormat|
+----------+
|01-01-1000|
|01-15-1000|
+----------+
This last results is what I want.
Upvotes: 4
Views: 20340
Reputation: 101
You can solve it by first casting your date from string to timestamp with the to_timestamp
function available on pyspark.sql (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_timestamp.html)
# Changing date formats:
from pyspark.sql.functions import to_timestamp, date_format
df.select(date_format(to_timestamp('dates1', "MM-dd-yyyy"), "MM-dd-yyy").alias('newFormat')).show(5)
+----------+
| newFormat|
+----------+
|08-11-1991|
|11-11-1991|
|12-04-1991|
|03-30-1991|
|04-22-1991|
+----------+
You can even get, for example, only the month and day:
df.select(date_format(to_timestamp('dates1', "MM-dd-yyyy"), "MM-dd").alias('newFormat')).show(5)
+---------+
|newFormat|
+---------+
| 08-11|
| 11-11|
| 12-04|
| 03-30|
| 04-22|
+---------+
Upvotes: 0
Reputation: 330113
It doesn't work because your data is not a valid ISO 8601 representation and cast to date returns NULL
:
sqlContext.sql("SELECT CAST('12-21-1991' AS DATE)").show()
## +----+
## | _c0|
## +----+
## |null|
## +----+
You'll have to parse data first using custom format:
output_format = ... # Some SimpleDateFormat string
df.select(date_format(
unix_timestamp("dates1", "MM-dd-yyyy").cast("timestamp"),
output_format
))
Upvotes: 12