Reputation: 789
I am using spark 2.1.0. I have dataframe as mentioned below. I am very new to pyspark I am stuck up with this issue
Now the problem statement is : Taking b column into consideration I should populate the C column from reference to column a i,e For every 4 values from column a , column c has to be populated with referring to values from column b. For example as shown in below dataframe from row no :4 equivalent c value is 30 . This 30 has been obtained from column b having its equivalent a to be 1
Below is my original dataframe
a b
0 0
1 30
2 18
3 10
4 5
5 1
1 8
2 6
3 9
4 3
5 4
1 2
2 18
3 2
4 1
5 15
Resulting dataframe should be as shown below:
a b c
0 0 0
1 30 0
2 18 0
3 10 0
4 5 30
5 1 18
1 8 10
2 6 5
3 9 1
4 3 8
5 4 6
1 2 9
2 18 3
3 2 4
4 1 2
5 15 18
Please help me in resolving this Thanks in advance
Upvotes: 0
Views: 159
Reputation: 23119
You can use the lag function to get the previous row as below
import spark.implicits._
val df1 = Seq (
(0, 0),
(1, 30),
(2, 18),
(3, 10),
(4, 5),
(5, 1),
(1, 8),
(2, 6),
(3, 9),
(4, 3),
(5, 4),
(1, 2),
(2, 18),
(3, 2),
(4, 1),
(5, 15)
).toDF("a", "b")
df1.show()
df1.withColumn("id", monotonically_increasing_id())
.withColumn("c", lag($"b", 3, 0).over(Window.orderBy("id")))
.drop("id")
.show()
Output:
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 0| 0|
| 1| 30| 0|
| 2| 18| 0|
| 3| 10| 0|
| 4| 5| 30|
| 5| 1| 18|
| 1| 8| 10|
| 2| 6| 5|
| 3| 9| 1|
| 4| 3| 8|
| 5| 4| 6|
| 1| 2| 9|
| 2| 18| 3|
| 3| 2| 4|
| 4| 1| 2|
| 5| 15| 18|
+---+---+---+
Hope this helps!
Upvotes: 1