how to carry over last non empty value to subsequent rows using Spark DataFrame

Question

I've a sparse dataset like this:

  ip,ts,session
  "123","1","s1"
  "123","2",""
  "123","3",""
  "123","4",""
  "123","10","s2"
  "123","11",""
  "123","12",""
  "222","5","s6"
  "222","6",""
  "222","7",""

I need to make it dense like this:

  ip,ts,session
  "123","1","s1"
  "123","2","s1"
  "123","3","s1"
  "123","4","s1"
  "123","10","s2"
  "123","11","s2"
  "123","12","s2"
  "222","5","s6"
  "222","6","s6"
  "222","7","s6"

I know how to do it using RDD - re-partition by ip and within partitionMap groupBy(ip).sortBy(ts).scan()(): scan function will carry over prior calculated value to the next iteration and decide to use prior value or keep current and pass new choice to next "scan" iteration

Now I'm trying to use DataFrame only, without going back to RDD. I was looking at Window functions, but all I could come up with is first value within group, which is not the same. Or I just do not understand how to create correct range.

David Griffin · Accepted Answer

You can do it with multiple self-joins. Basically, you want to create a data set of all the "start session" records (filter($"session" !== "")) and then join that against the original data set, filtering out the records where the "session start" was later than the current session (filter($"ts" >= $"r_ts")). Then you want to find out the max($"r_ts") for each ip. The last join is just to retrieve the session value from the original data set.

data.join(
  data.filter($"session" !== "").select(
    $"ip" as "r_ip", $"session" as "r_session", $"ts" as "r_ts"
  ), 
  $"ip" === $"r_ip"
)
.filter($"ts" >= $"r_ts")
.groupBy($"ip",$"ts")
.agg(max($"r_ts") as "r_ts")
.join(
  data.select($"session",$"ts" as "l_ts"), 
  $"r_ts" === $"l_ts"
)
.select($"ip",$"ts",$"session")

BTW, my solution assumes that the column ts is something like a transaction sequence -- that it is an incrementing Int value. If it's not, you can use my DataFrame-ified zipWithIndex solution to create a column that will serve the same purpose.

how to carry over last non empty value to subsequent rows using Spark DataFrame

Answers (2)

Related Questions