Reputation: 349
I want to filter my data for Datetime column in the format yyy-mm-dd. However, its string value and there is a timestamp associated with the date. I don't want this timestamp in my column. I am using Pyspark for it.
Format of date- 2021/09/23 09:00:00+00,
Format to be done- 2021-09-23
from pyspark.sql.functions import to_date
df = df_pyspark.withColumn("date_only",to_date(col("DateTime"))) #col name in data is DateTime
The date_only
is showing null values. How should I approach here?
Upvotes: 0
Views: 1681
Reputation: 15318
When using the fonction to_date
, you need to pass a format string. The format string can be created using the official documentation for simpleDateFormat - avaialble from the spark documentation directly.
In your case, the format is yyyy/MM/dd HH:mm:ssX
:
df.withColumn("t", F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX")).show(truncate=False)
+----------------------+----------+
|DateTime |t |
+----------------------+----------+
|2021/09/23 09:00:00+00|2021-09-23|
+----------------------+----------+
you can then filter on the date :
df.where(F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX") == "2021-09-23").show()
+--------------------+
| DateTime|
+--------------------+
|2021/09/23 09:00:...|
+--------------------+
Upvotes: 1