Reputation: 41
I have a PySpark dataframe that looks like this:
Date Sales Type
0 2020-01-01 10 hotdog
1 2020-01-01 5 icecream
2 2020-01-01 9 soda
3 2020-01-02 7 hotdog
4 2020-01-02 5 icecream
.. ... ... ...
89 2020-01-30 4 icecream
90 2020-01-30 11 soda
91 2020-01-31 7 hotdog
92 2020-01-31 3 icecream
93 2020-01-31 12 soda
I need to add a column that indicates if the row date is before or after 2020-01-15
In pandas, I can do
df['Before-After'] = df['Date'] < '2020-01-15'
How to do that in a PySpark dataframe?
Upvotes: 0
Views: 242
Reputation: 31470
Use when+otherwise
statement for this case.
Exmaple:
from pyspark.sql.functions import *
df=spark.createDataFrame([('2020-01-01',),('2020-01-02',),('2020-01-16',),('2020-01-31',)],['Date']).withColumn("Date",col("Date").cast("date"))
df.withColumn("Before-After",when(col("Date") < "2020-01-15",True).otherwise(False)).show()
#sql
df.createOrReplaceTempView("tmp")
spark.sql("select Date, case when Date < '2020-01-15' then True else False end as `Before-After` from tmp").show()
#+----------+------------+
#| Date|Before-After|
#+----------+------------+
#|2020-01-01| true|
#|2020-01-02| true|
#|2020-01-16| false|
#|2020-01-31| false|
#+----------+------------+
Upvotes: 2