soupe
soupe

Reputation: 95

Spark: restart counting on specific value

I have a dataFrame with Boolean records and i want restart the counting when goal=False/Null.

How can i get the Score tab ? The score tab is a count of True values with a reset on False/null values

My df:

Goals
Null 
True
False 
True
True
True
True
False 
False 
True
True

Expected Result:

Goals   Score
Null    0
True    1
False   0
True    1
True    2
True    3
True    4
False   0
False   0
True    1
True    2

EDIT: Adding more infos

Actually my full dataset is:

Player   Goals         Date         Score
1       Null    2017-08-18 10:30:00 0
1       True    2017-08-18 11:30:00 1
1       False   2017-08-18 12:30:00 0
1       True    2017-08-18 13:30:00 1
1       True    2017-08-18 14:30:00 2
1       True    2017-08-18 15:30:00 3
1       True    2017-08-18 16:30:00 4
1       False   2017-08-18 17:30:00 0
1       False   2017-08-18 18:30:00 0
1       True    2017-08-18 19:30:00 1
1       True    2017-08-18 20:30:00 2
2       False   2017-08-18 10:30:00 0
2       False   2017-08-18 11:30:00 0
2       True    2017-08-18 12:30:00 1
2       True    2017-08-18 13:30:00 2
2       False   2017-08-18 15:30:00 0

I've created a window to calculate the score by player on a certain date

   val w = Window.partitionBy("Player","Goals").orderBy("date")

I've tried with the lag function and comparing the values but i can't reset the count.

EDIT2: Add unique Date per player

Thank you.

Upvotes: 0

Views: 1130

Answers (1)

soupe
soupe

Reputation: 95

I finally solved the problem with grouping the goals that occurs together.

I used a count over a partition containing the difference between the row index of the "table" and the row_number related to the partitioned window.

First declare the window with future columns to use

val w = Window.partitionBy("player","goals","countPartition").orderBy("date")

Then populate the columns "countPartition" and "goals" with 1 to keep the rowNumber neutral

  val list1=  dataList.withColumn("countPartition", lit(1)).withColumn("goals", lit(1)).withColumn("index", rowNumber over w )

the udf

def div = udf((countInit: Int, countP: Int) => countInit-countP)

And finally calculate the score

  val score = list1.withColumn("goals", goals).withColumn("countPartition", div(col("index") , rowNumber over w )).withColumn("Score", when(col("goals") === true, count("goals") over w ).otherwise(when(col("goals") isNull, "null").otherwise(0))).orderBy("date")

Upvotes: 1

Related Questions