Naveen Srikanth
Naveen Srikanth

Reputation: 789

Pyspark data transformation logic to assign one column values to another

I am using spark 2.1.0. I have dataframe as mentioned below. I am very new to pyspark I am stuck up with this issue

Now the problem statement is : Taking b column into consideration I should populate the C column from reference to column a i,e For every 4 values from column a , column c has to be populated with referring to values from column b. For example as shown in below dataframe from row no :4 equivalent c value is 30 . This 30 has been obtained from column b having its equivalent a to be 1

Below is my original dataframe

a   b
0   0
1   30
2   18
3   10
4   5
5   1
1   8
2   6
3   9
4   3
5   4
1   2
2   18
3   2
4   1
5   15

Resulting dataframe should be as shown below:

a   b   c
0   0   0
1   30  0
2   18  0
3   10  0
4   5   30
5   1   18
1   8   10
2   6   5
3   9   1
4   3   8
5   4   6
1   2   9
2   18  3
3   2   4
4   1   2
5   15  18

Please help me in resolving this Thanks in advance

Upvotes: 0

Views: 159

Answers (1)

koiralo
koiralo

Reputation: 23119

You can use the lag function to get the previous row as below

import spark.implicits._
val df1 = Seq (
  (0, 0),
  (1, 30),
  (2, 18),
  (3, 10),
  (4, 5),
  (5, 1),
  (1, 8),
  (2, 6),
  (3, 9),
  (4, 3),
  (5, 4),
  (1, 2),
  (2, 18),
  (3, 2),
  (4, 1),
  (5, 15)
).toDF("a", "b")

df1.show()

df1.withColumn("id", monotonically_increasing_id())
  .withColumn("c", lag($"b", 3, 0).over(Window.orderBy("id")))
  .drop("id")
  .show()

Output:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  0|  0|
|  1| 30|  0|
|  2| 18|  0|
|  3| 10|  0|
|  4|  5| 30|
|  5|  1| 18|
|  1|  8| 10|
|  2|  6|  5|
|  3|  9|  1|
|  4|  3|  8|
|  5|  4|  6|
|  1|  2|  9|
|  2| 18|  3|
|  3|  2|  4|
|  4|  1|  2|
|  5| 15| 18|
+---+---+---+

Hope this helps!

Upvotes: 1

Related Questions