Reputation: 1097
I've seen (here: How to convert Timestamp to Date format in DataFrame?) the way to convert a timestamp in datetype, but,at least for me, it doesn't work.
Here is what I've tried:
# Create dataframe
df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',])
# Convert to timestamp
df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\
# Convert timestamp to date again
df_test2.withColumn('date_again', df_test2['timestamp'].cast(stypes.DateType())).show()
But this returns null in the column date_again
:
+--------+----------+----------+
| date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502229600| null|
|20171007|1507327200| null|
+--------+----------+----------+
Any idea of what's failing?
Upvotes: 27
Views: 133211
Reputation: 3
You can directly cast the field:
df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))
Upvotes: 0
Reputation: 24478
An option without import TimestampType
:
import pyspark.sql.functions as F
F.from_unixtime(F.col('date_col') / 1000).cast('date')
Upvotes: 0
Reputation: 5700
they closed my question as duplicate of this one so I'll copy and paste my answer here (is a duplicate, right?)
As the timestamp column is in milliseconds is just necessary to convert into seconds and cast it into TimestampType
and that should do the trick:
from pyspark.sql.types import TimestampType
import pyspark.sql.functions as F
df.select(
(F.col("my_timestamp") / 1000).cast(TimestampType())
)
Upvotes: 1
Reputation: 272
#udf to convert the ts to timestamp
get_timestamp = udf(lambda x : datetime.datetime.fromtimestamp(x/ 1000.0).strftime("%Y-%m-%d %H:%M:%S"))
#apply this udf in the dataframe with your timestamp
df_withdate = df.withColumn("datetime", get_timestamp(df.ts))
Upvotes: 2
Reputation: 5075
To convert a unix_timestamp
column (called TIMESTMP
) in a pyspark dataframe (df
) -- to a Date
type:
Below is a two step process (there may be a shorter way):
timestamp
timestamp
to Date
Initially the df.printShchema()
shows: -- TIMESTMP: long (nullable = true)
use spark.SQL
to implement the conversion as follows:
df.registerTempTable("dfTbl")
dfNew= spark.sql("""
SELECT *, cast(TIMESTMP as Timestamp) as newTIMESTMP
FROM dfTbl d
""")
dfNew.printSchema()
the printSchema() will show:
-- newTIMESTMP: timestamp (nullable = true)
finally convert the type from timestamp
to Date
as follows:
from pyspark.sql.types import DateType
dfNew=dfNew.withColumn('actual_date', dfNew['newTIMESTMP'].cast(DateType()))
Upvotes: 3
Reputation: 5075
For pyspark:
Assume you have a field name: 'DateTime' that shows the date as a date and a time
Add a new field to your df that shows a 'DateOnly' column as follows:
from pyspark.sql.functions import date_format
df.withColumn("DateOnly", date_format('DateTime', "yyyyMMdd")).show()
This will show a new column in the df called DateOnly- with the date in yyyymmdd form
Upvotes: 8
Reputation: 35249
Following:
func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd'))
doesn't work because it is type inconsistent - the first clause returns string
while the second clause returns bigint
. As a result it will always return NULL
if data
is NOT NULL
and not empty.
It is also obsolete - SQL functions are NULL
and malformed format safe. There is no need for additional checks.
In [1]: spark.sql("SELECT unix_timestamp(NULL, 'yyyyMMdd')").show()
+----------------------------------------------+
|unix_timestamp(CAST(NULL AS STRING), yyyyMMdd)|
+----------------------------------------------+
| null|
+----------------------------------------------+
In [2]: spark.sql("SELECT unix_timestamp('', 'yyyyMMdd')").show()
+--------------------------+
|unix_timestamp(, yyyyMMdd)|
+--------------------------+
| null|
+--------------------------+
And you don't need intermediate step in Spark 2.2 or later:
from pyspark.sql.functions import to_date
to_date("date", "yyyyMMdd")
Upvotes: 26
Reputation: 41987
you should be doing the following
>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).show()
+--------+----------+----------+
| date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502216100|2017-08-09|
|20171007|1507313700|2017-10-07|
+--------+----------+----------+
and schema is
>>> df_test2.withColumn('date_again', func.from_unixtime('timestamp').cast(DateType())).printSchema()
root
|-- date: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- date_again: date (nullable = true)
Upvotes: 18