Reputation: 95
I am looking to compute the datasets which are the results of the indexes combinations of a list of integers : as an example if i have the following list of integers [0,1,2,3]
and the initial dataset :
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 0| A | 1000|
| 1| B | 1000|
| 2| C | 2000|
| 3| D | 3000|
+---+--------------+---------+
And a limit number = 2
then given that the resulting index combinations are :
[0, 1]
[0, 2]
[0, 3]
[1, 2]
[1, 3]
[2, 3]
i would want the following corresponding datasets :
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 0| A | 1000|
| 1| B | 1000|
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 0| A | 1000|
| 2| C | 2000|
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 0| A | 1000|
| 3| D | 3000|
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 1| B | 1000|
| 2| C | 2000|
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 1| B | 1000|
| 3| D | 3000|
+---+--------------+---------+
| id|Shop Locations| Qte |
+---+--------------+---------+
| 2| C | 2000|
| 3| D | 3000|
For now i am doing it over one node using the classical way of generating combination in JAVA through the following code :
private void helper(List<int[]> combinations, int data[], int start, int end, int index) {
if (index == data.length) {
int[] combination = data.clone();
combinations.add(combination);
} else if (start <= end) {
data[index] = start;
helper(combinations, data, start + 1, end, index + 1);
helper(combinations, data, start + 1, end, index);
}
}
public List<int[]> generate(int n, int r) {
List<int[]> combinations = new ArrayList<>();
helper(combinations, new int[r], 0, n-1, 0);
return combinations;
}
List<int[]> combinations = generate(numberOfRows, k);
for (int[] combination : combinations) {
ArrayList<Row> datasetRows = new ArrayList<Row>();
List<Row> rows = initialDataset.collectAsList();
for (int index : combination) {
datasetRows.add(rows.get(index));
}
Dataset<Row> datasetOfSRows = sparksession.createDataFrame(datasetRows, schema);
datasetOfRows.add(datasetOfSRows);
}
But i want a native Spark solution for this problem that will be using many nodes to compute the resulting datasets (e.g through map()
)
How to achieve that using JAVA / Scala ?
Upvotes: 1
Views: 149
Reputation: 154
You might need to read about isin
for Spark SQL. This link explains how to use it https://sparkbyexamples.com/spark/spark-isin-is-not-in-operator-example/.
After looking into your code, I try to come out with some code as below. Hope it might help you.
datasetOfRows.filter(col("id").isin(combination.toArray())
Upvotes: 2