How to generate datasets based on combinations of indexes?

Question

I am looking to compute the datasets which are the results of the indexes combinations of a list of integers : as an example if i have the following list of integers [0,1,2,3] and the initial dataset :

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000| 
|  1|     B        |     1000|
|  2|     C        |     2000|
|  3|     D        |     3000|
+---+--------------+---------+

And a limit number = 2 then given that the resulting index combinations are :

[0, 1]
[0, 2]
[0, 3]
[1, 2]
[1, 3]
[2, 3]

i would want the following corresponding datasets :

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000| 
|  1|     B        |     1000|


+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000|
|  2|     C        |     2000|

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  0|     A        |     1000|
|  3|     D        |     3000|


+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  1|     B        |     1000|
|  2|     C        |     2000|

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  1|     B        |     1000|
|  3|     D        |     3000|

+---+--------------+---------+
| id|Shop Locations|   Qte   |
+---+--------------+---------+
|  2|     C        |     2000|
|  3|     D        |     3000|

For now i am doing it over one node using the classical way of generating combination in JAVA through the following code :

private void helper(List combinations, int data[], int start, int end, int index) {
    if (index == data.length) {
        int[] combination = data.clone();
        combinations.add(combination);
    } else if (start <= end) {
        data[index] = start;
        helper(combinations, data, start + 1, end, index + 1);
        helper(combinations, data, start + 1, end, index);
    }
}

public List generate(int n, int r) {
    List combinations = new ArrayList<>();
    helper(combinations, new int[r], 0, n-1, 0);
    return combinations;
}

            

List combinations = generate(numberOfRows, k);

for (int[] combination : combinations) {
    ArrayList datasetRows = new ArrayList();

    List rows = initialDataset.collectAsList();
    for (int index : combination) {

        datasetRows.add(rows.get(index));

     }
    Dataset datasetOfSRows = sparksession.createDataFrame(datasetRows, schema);

    datasetOfRows.add(datasetOfSRows);

    }

But i want a native Spark solution for this problem that will be using many nodes to compute the resulting datasets (e.g through map()) How to achieve that using JAVA / Scala ?

ksoulllpwk · Accepted Answer

You might need to read about isin for Spark SQL. This link explains how to use it https://sparkbyexamples.com/spark/spark-isin-is-not-in-operator-example/.

After looking into your code, I try to come out with some code as below. Hope it might help you.

datasetOfRows.filter(col("id").isin(combination.toArray())

How to generate datasets based on combinations of indexes?

Answers (1)

Related Questions