Compare Value of Current and Previous Row, and after for Column if required in Spark

Question

I am trying to select the value of one column based on the values of other rows and other columns.

scala> val df = Seq((1,"051",0,0,10,0),(1,"052",0,0,0,0),(2,"053",10,0,10,0),(2,"054",0,0,10,0),(3,"055",100,50,0,0),(3,"056",100,10,0,0),(3,"057",100,20,0,0),(4,"058",70,15,0,0),(4,"059",70,15,0,20),(4,"060",70,15,0,0)).toDF("id","code","value_1","value_2","value_3","Value_4")
scala> df.show()
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0| 
|  3| 057|    100|     20|      0|      0| 
|  4| 058|     70|     15|      0|      0| 
|  4| 059|     70|     15|      0|     20| 
|  4| 060|     70|     15|      0|      0| 
+---+----+-------+-------+-------+-------+

Calculation Logic:

Select a code for an id, following the steps

For each column n(value_1,value_2,value_3,value_4), do
For the same id look for the maximum value in the value_n column
If the maximum value is repeated, the next column is evaluated
Otherwise, if the maximum value is found without repetition, the id and the code of the column with the maximum value are taken. It is no longer necessary to evaluate the following columns.

Expected Output:

+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  2| 053|     10|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
+---+----+-------+-------+-------+-------+

In case of id 3:

It has the codes 055, 056, 057
value_1 has the values 100 for all three codes, so the maximum value is 100 but it repeats for all three codes, I can't select a code.
The value_2 column has to be evaluated, which has the values 50,10 and 20 for each code respectively
So the maximum value is 50 among the three codes, and it is unique.
The record with id 3 and code 055 is selected

Please help.

C.S.Reddy Gadipally · Accepted Answer

You can put your value_1 to 4 in a struct and call max function on it groupedBy id column using window


scala> df.show
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0|
|  3| 057|    100|     20|      0|      0|
|  4| 058|     70|     15|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  4| 060|     70|     15|      0|      0|
+---+----+-------+-------+-------+-------+


scala> val dfWithVals = df.withColumn("values", struct($"value_1", $"value_2", $"value_3", $"value_4"))
dfWithVals: org.apache.spark.sql.DataFrame = [id: int, code: string ... 5 more fields]

scala> dfWithVals.show
+---+----+-------+-------+-------+-------+---------------+
| id|code|value_1|value_2|value_3|Value_4|         values|
+---+----+-------+-------+-------+-------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  1| 052|      0|      0|      0|      0|   [0, 0, 0, 0]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]|
|  2| 054|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|
|  3| 056|    100|     10|      0|      0|[100, 10, 0, 0]|
|  3| 057|    100|     20|      0|      0|[100, 20, 0, 0]|
|  4| 058|     70|     15|      0|      0| [70, 15, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|
|  4| 060|     70|     15|      0|      0| [70, 15, 0, 0]|
+---+----+-------+-------+-------+-------+---------------+


scala> val overColumns =org.apache.spark.sql.expressions.Window.partitionBy("id")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@de0daca

scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").show
+---+----+-------+-------+-------+-------+---------------+---------------+      
| id|code|value_1|value_2|value_3|Value_4|         values|        maxvals|
+---+----+-------+-------+-------+-------+---------------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|[100, 50, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|[70, 15, 0, 20]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]| [10, 0, 10, 0]|
+---+----+-------+-------+-------+-------+---------------+---------------+



scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").drop("values", "maxvals").show
+---+----+-------+-------+-------+-------+                                      
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  2| 053|     10|      0|     10|      0|
+---+----+-------+-------+-------+-------+

Compare Value of Current and Previous Row, and after for Column if required in Spark

Answers (2)

Related Questions