Gerard
Gerard

Reputation: 35

Finding the maximum sequence in a given range - Spark/Scala

I am practicing a problem with Spark using Scala and I have a football league table with columns (teamname, lastgame, nextgame, dateoflastgame, dateofnextgame)

I am trying to run a query on the data where I can determine the longest run a team will go without playing the same team twice. I. E. For a team (CHL, MNU, ARS, 01/01/2020,08/01/2020)

There are multiple entries for CHL across the season having all this data, I am trying to find the longest run they would have without coming across a certain team e. g. MCU.

For example a run would look like of next games, MCU-BHA-LIV-WHU-TOT-MCU would be a run of 4 games.

I have been using spark sql and thinking querying against nextgame where teamname='mnu' but I'm quite puzzled how I would count these values and return the maximum run of games without playing MCU. Furthermore, I would like to do this operation against all the teams in the league to get a list of the longest runs without playing MCU.

Any thoughts where I could start with this?

Upvotes: 2

Views: 458

Answers (1)

Kombajn zbożowy
Kombajn zbożowy

Reputation: 10703

Window functions to the rescue.

val df = Seq(
    ("CHL", "MCU", "2020-01-01"),
    ("CHL", "BHA", "2020-01-02"),
    ("CHL", "LIV", "2020-01-03"),
    ("CHL", "WHU", "2020-01-04"),
    ("CHL", "TOT", "2020-01-05"),
    ("CHL", "MCU", "2020-01-06")
).toDF("teamname", "nextgame", "dateofnextgame")

val w = Window.partitionBy($"teamname")

df.withColumn("seq", row_number().over(w.orderBy($"dateofnextgame"))).
  withColumn("max", count(lit(1)).over(w)).
  filter($"nextgame" === "MCU").
  withColumn("next_seq", coalesce(lead($"seq" - 1, 1).over(w.orderBy($"seq")), lit($"max"))).
  withColumn("streak", $"next_seq" - $"seq").
  groupBy($"teamname").
  agg(max($"streak").as("max_streak")).
  show

+--------+----------+
|teamname|max_streak|
+--------+----------+
|     CHL|         4|
+--------+----------+

In the end it's pretty simple, you just assign a sequence number to games and the longest run is maximum difference between seq of current and next game against MCU.

Upvotes: 2

Related Questions