Reputation: 383
I have the below DF with me
+------+------+----+
| Year| PY| VAL|
+------+------+----+
|202005|201905|2005|
|202006|201906|2006|
|202007|201907|2007|
|201905|201805|1905|
|201906|201806|1906|
|201907|201807|1907|
|201805|201705|1805|
|201806|201706|1806|
|201807|201707|1807|
+------+------+----+
obtained by
val df1=Seq(
("202005","201905","2005"),
("202006","201906","2006"),
("202007","201907","2007"),
("201905","201805","1905"),
("201906","201806","1906"),
("201907","201807","1907"),
("201805","201705","1805"),
("201806","201706","1806"),
("201807","201707","1807")
)toDF("Year","PY","VAL")
I would like to populate the Value of previous year(VAL_PY) in a separate column. That value actually resides in a different row in the same DF.
Also , I would like to achieve this in a distributed way as my DF is a big one (> 10 million records)
Expected output --
+------+------+----+-------+
| Year| PY| VAL| VAL_PY|
+------+------+----+-------+
|202005|201905|2005|1905 |
|202006|201906|2006|1906 |
|202007|201907|2007|1907 |
|201905|201805|1905|1805 |
|201906|201806|1906|1806 |
|201907|201807|1907|1807 |
|201805|201705|1805|null |
|201806|201706|1806|null |
|201807|201707|1807|null |
+------+------+----+-------+
Upvotes: 0
Views: 164
Reputation: 2091
val df1=Seq(("202005","201905","2005"),("202006","201906","2006"),("202007","201907","2007"),("201905","201805","1905"),("201906","201806","1906"),("201907","201807","1907"),("201805","201705","1805"),("201806","201706","1806"),("201807","201707","1807")
)toDF("Year","PY","VAL")
val df2 = df1
.drop("PY")
.withColumnRenamed("VAL","VAL_PY")
.withColumnRenamed("Year","PY")
df1.join(df2, Seq("PY"),"left")
.select("Year","PY","VAL","VAL_PY").show
OUTPUT :
+------+------+----+------+
| Year| PY| VAL|VAL_PY|
+------+------+----+------+
|202005|201905|2005| 1905|
|202006|201906|2006| 1906|
|202007|201907|2007| 1907|
|201905|201805|1905| 1805|
|201906|201806|1906| 1806|
|201907|201807|1907| 1807|
|201805|201705|1805| null|
|201806|201706|1806| null|
|201807|201707|1807| null|
+------+------+----+------+
Seemed like a left self join. Please let me know if I am missing something.
Upvotes: 2