Reputation: 23
Sample dataframe can be created using:
from pyspark.sql.functions import col
from pyspark.sql.window import Window
df = sc.parallelize([['2019-08-29 01:00:00',0],
['2019-08-29 02:00:00',0],
['2019-08-29 03:00:00',0],
['2019-08-29 04:00:00',1],
['2019-08-29 05:00:00',2],
['2019-08-29 06:00:00',3],
['2019-08-29 07:00:00',0],
['2019-08-29 08:00:00',2],
['2019-08-29 09:00:00',0],
['2019-08-29 10:00:00',1]]).toDF(['DATETIME','VAL']).withColumn('DATETIME',col('DATETIME').cast('timestamp'))
I want to produce a column with a count equal to the number of occurrences of 0 values within a 3 hour period (+/- 1 hour of the current time inclusive of the current Val). The window can be created using:
w1 = (Window()
.orderBy(col('DATETIME').cast('long'))
.rangeBetween(-(60*60), 60*60))
Desired outcome:
+-------------------+---+---+
| DATETIME|VAL|NUM|
+-------------------+---+---+
|2019-08-29 01:00:00| 0| 2|
|2019-08-29 02:00:00| 0| 3|
|2019-08-29 03:00:00| 0| 2|
|2019-08-29 04:00:00| 1| 1|
|2019-08-29 05:00:00| 2| 0|
|2019-08-29 06:00:00| 3| 1|
|2019-08-29 07:00:00| 0| 1|
|2019-08-29 08:00:00| 2| 2|
|2019-08-29 09:00:00| 0| 1|
|2019-08-29 10:00:00| 1| 1|
+-------------------+---+---+
Upvotes: 0
Views: 145
Reputation: 5068
If you have only 1 entry per DATETIME, you can use lead and lag function to get the previous and next values and then you can count for zeros.
from pyspark.sql.functions import udf, array, col
from pyspark.sql.types import IntegerType
count_zeros_udf = udf(lambda arr: arr.count(0), IntegerType())
df.withColumn('lag1', f.lag(col('VAL'), 1, -1).over(Window.orderBy("DATETIME"))) # Get the previous value
.withColumn('lag2', f.lead(col('VAL'), 1, -1).over(Window.orderBy("DATETIME"))) # Get the next value
.withColumn('NUM', count_zeros_udf(array('VAL', 'lag1', 'lag2'))) # Count zeros using the udf
.drop('lag1', 'lag2') # Drop the extra columns
.show()
+-------------------+---+---+
| DATETIME|VAL|NUM|
+-------------------+---+---+
|2019-08-29 01:00:00| 0| 2|
|2019-08-29 02:00:00| 0| 3|
|2019-08-29 03:00:00| 0| 2|
|2019-08-29 04:00:00| 1| 1|
|2019-08-29 05:00:00| 2| 0|
|2019-08-29 06:00:00| 3| 1|
|2019-08-29 07:00:00| 0| 1|
|2019-08-29 08:00:00| 2| 2|
|2019-08-29 09:00:00| 0| 1|
|2019-08-29 10:00:00| 1| 1|
+-------------------+---+---+
With pyspark >= 2.4, you can use UDF with pandas UDF on window as explained here User defined function to be applied to Window in PySpark? . Unfortunately, I dont have pyspark 2.4 or above, therefore I cannot test it.
Upvotes: 1