separate the dataset as per rank in spark

Question

I have a dataset which includes name of employees, balance, date and employees has separate rank number.

df.show();
+------------+----------+-------+----+
|    Employee|      date|balance|rank|
+------------+----------+-------+----+
|      A     |2016-02-05|   2143|   1|
|      A     |2016-07-05|    231|   2|
|      A     |2016-08-05|    447|   3|
|      A     |2017-10-05|    779|   4|
|      A     |2018-03-05|    255|   5|
|      A     |2018-05-05|    246|   6|
|      A     |2018-08-05|    378|   7|
|      A     |2018-11-05|  10635|   8|
|      A     |2019-06-05|     49|   9|
|      A     |2020-02-05|      0|  10|
|      A     |2020-04-05|    244|  11|
|      A     |2020-05-05|      0|  12|
|      A     |2020-09-05|    424|  13|
|      C     |2016-05-05|   1506|   1|
|      C     |2017-06-05|     52|   2|
|      C     |2017-09-05|    723|   3|
|      C     |2017-11-05|     23|   4|
+------------+----------+-------+----+

I have to separate this dataset as per rank. so my expected output is

table1
+------------+----------+-------+----+
|    Employee|      date|balance|rank|
+------------+----------+-------+----+
|      A     |2016-02-05|   2143|   1|
|      A     |2016-07-05|    231|   2|
|      A     |2016-08-05|    447|   3|
|      A     |2017-10-05|    779|   4|
|      A     |2018-03-05|    255|   5|
|      A     |2018-05-05|    246|   6|
|      A     |2018-08-05|    378|   7|
|      A     |2018-11-05|  10635|   8|
|      A     |2019-06-05|     49|   9|
|      A     |2020-02-05|      0|  10|
|      A     |2020-04-05|    244|  11|
|      A     |2020-05-05|      0|  12|
|      A     |2020-09-05|    424|  13|
+------------+----------+-------+----+


table2

+------------+----------+-------+----+
|    Employee|      date|balance|rank|
+------------+----------+-------+----+
|      C     |2016-05-05|   1506|   1|
|      C     |2017-06-05|     52|   2|
|      C     |2017-09-05|    723|   3|
|      C     |2017-11-05|     23|   4|
+------------+----------+-------+----+

I used window function for getting this rank but I didn't get how can I get separate tables like this. I am using spark 2.0.0 and java.

     WindowSpec ws = Window.partitionBy(Employee).orderBy(date);                                  
     data.withColumn( "rank", rank().over(ws) )

Lokesh Yadav · Accepted Answer

Here is the sample code to achieve this, by filtering for the distinct values of Employee:

//Getting the distinct columns
List distinctColumns = df.select("Employee").distinct().collectAsList();

//Initializing empty list for the new DataFrames
ArrayList> newDFs = new ArrayList<>();

WindowSpec ws = Window.orderBy("date");

//Filtering by the distinct column values and adding to the list.
for (Row distinctColumn : distinctColumns) {
    String colName = distinctColumn.getString(0);

    newDFs.add(
            df.filter(col("Employee").$eq$eq$eq(colName))
                    .withColumn("rank", rank().over(ws))
    );
}

// show all the new DFs
for (Dataset aDF : newDFs) {
    aDF.show();
}

separate the dataset as per rank in spark

Answers (1)

Related Questions