Group by and aggregate tuples in Spark SQL

Question

I'm working with Spark SQL and Java. I have a dataset with duplicate clients grouped by ENTITY and DOCUMENT_ID. I added a rownumber column to know how many clients are there (and I have to compare) for each group:

.withColumn( "ROWNUMBER", row_number().over(Window.partitionBy("ENTITY", "ENTITY_DOC").orderBy("ID")))

+---------+----------+-----------------+----------+-----------+-----------+--------+------------+
|ROWNUMBER|  ENTITY  |      ENTITY_DOC |    ID    |  BLOCK    |  TYPE_DOC |COD_BEST|COD_CRITERIO|
+---------+----------+-----------------+----------+-----------+-----------+--------+------------+
|        1|       182|000004693R       |   5254578|          3|         01|       0|           0|
|        2|       182|000004693R       |  99841470|          0|         01|       0|           0|
|        3|       182|000004693R       |  45866239|          3|         01|       0|           0|
|        1|       182|000081638B       |  99804050|          0|         01|       0|           0|
|        2|       182|000081638B       |  99803968|          0|         01|       0|           0|
|        3|       182|000081638B       |  99803958|          0|         01|       0|           0|
|        4|       182|000081638B       |  99804054|          0|         01|       0|           0|
|        5|       182|000081638B       |  99787706|          1|         01|       0|           0|
|        6|       182|000081638B       |  99803930|          0|         01|       0|           0|
|        1|       182|000107084L       |  99819126|          0|         01|       0|           0|
|        2|       182|000107084L       |  99818446|          0|         01|       0|           0|
+---------+----------+-----------------+----------+-----------+-----------+--------+------------+

Now I have to compare pairs of rows in order to decide which is the best.

First compare rownumber1 vs rownumber2 (if rownumber2 is the best) then
compare rownumber2 vs rownumber3 (if rownumber3 is the best)
then compare rownumber3 vs rownumber4 ... etc

It is decided which is the best based on certain business criteria like:

//criteria 1
BLOCK = 1 VS BLOCK 1
   //go to the next criteria
//criteria 2
BLOCK = 2 VS BLOCK 1
   //the best is BLOCK 2

//criteria3
TYPE_DOC = 1 VS TYPE_DOC = 1
 //go to the next criteria

//criteria4
 TYPE_DOC = 1 VS TYPE_DOC = 2
  //the best is TYPE_DOC 1

(not is a logical example but to get an idea)

In the end I have to know which is the best row of each group and by what criteria it has been selected, but I don't know how to iterate each group to compare the fields of its rows.

Would it be very difficult to do?

Group by and aggregate tuples in Spark SQL

Answers (1)

Related Questions