Reputation: 301
Suppose you have a table with snapshots of accounts from the past 4 years, 1 row per date/account pair. (Note accounts don't need to have a row for every day).
For simplicity suppose there are 3 columns: acct_number
, date
and balance
. How would you add 2 more columns to each row that are the date/balance for that account for the previous date recorded in the table?
It seems like sorting by acct_number
and date then "offsetting by one" and joining to the original table is a promising approach but I don't know how you'd implement this.
Upvotes: 1
Views: 139
Reputation: 301
The solution is to use Windows functionality with a lag of 1.
import org.apache.spark.sql.expressions.Window
val df = load_my_df()
val frame = Window.partitionBy("acct_number").orderBy("date")
val df_new = df.withColumn("prev_balance", lag("balance", 1, null).over(frame))
Great examples here: https://sparkbyexamples.com/spark/spark-sql-window-functions/
Documentation: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Upvotes: 2