Reputation: 35
I am practicing a problem with Spark using Scala and I have a football league table with columns (teamname, lastgame, nextgame, dateoflastgame, dateofnextgame
)
I am trying to run a query on the data where I can determine the longest run a team will go without playing the same team twice. I. E. For a team (CHL, MNU, ARS, 01/01/2020,08/01/2020
)
There are multiple entries for CHL across the season having all this data, I am trying to find the longest run they would have without coming across a certain team e. g. MCU.
For example a run would look like of next games, MCU-BHA-LIV-WHU-TOT-MCU
would be a run of 4 games.
I have been using spark sql and thinking querying against nextgame where teamname='mnu'
but I'm quite puzzled how I would count these values and return the maximum run of games without playing MCU. Furthermore, I would like to do this operation against all the teams in the league to get a list of the longest runs without playing MCU.
Any thoughts where I could start with this?
Upvotes: 2
Views: 458
Reputation: 10703
Window functions to the rescue.
val df = Seq(
("CHL", "MCU", "2020-01-01"),
("CHL", "BHA", "2020-01-02"),
("CHL", "LIV", "2020-01-03"),
("CHL", "WHU", "2020-01-04"),
("CHL", "TOT", "2020-01-05"),
("CHL", "MCU", "2020-01-06")
).toDF("teamname", "nextgame", "dateofnextgame")
val w = Window.partitionBy($"teamname")
df.withColumn("seq", row_number().over(w.orderBy($"dateofnextgame"))).
withColumn("max", count(lit(1)).over(w)).
filter($"nextgame" === "MCU").
withColumn("next_seq", coalesce(lead($"seq" - 1, 1).over(w.orderBy($"seq")), lit($"max"))).
withColumn("streak", $"next_seq" - $"seq").
groupBy($"teamname").
agg(max($"streak").as("max_streak")).
show
+--------+----------+
|teamname|max_streak|
+--------+----------+
| CHL| 4|
+--------+----------+
In the end it's pretty simple, you just assign a sequence number to games and the longest run is maximum difference between seq of current and next game against MCU.
Upvotes: 2