Martin Moore
Martin Moore

Reputation: 59

How to create all possible combinations of rows from a dataset

Suppose i have a dataset :

+----+----+----+-----+----+
||col0|col2|col2|col3|col4|
+----+----+----+-----+----+
|   t0|  10| 100|cat26|30.9|
|   t1|  20| 200|cat13|22.1|
|   t2|  30| 300|cat26|30.9|
|   t3|  40| 400|cat26|30.9|
|   t4|  50| 500|cat15|15.3|
|   t5|  60| 600|cat13|22.1|
+----+----+----+-----+----+

I select a sub-dataset using where() :

  Dataset<Row> subDf = dF.where("col3 = cat26 ");

Yielding :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     | t3 |  40| 400|cat26|30.9|
     +----+----+----+-----+----+

I want to create different combinations using these three rows to form for each combination a dataset.

An example of a combination is as follow :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     +----+----+----+-----+----+

Another one is :

 +----+----+----+-----+----+
 |col0|col2|col2|col3 |col4|
 +----+----+----+-----+----+
 |  t2|  30| 300|cat26|30.9|
 |  t3|  40| 400|cat26|30.9|
 +----+----+----+-----+----+

Third possible combination is as follows:

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t3|  40| 400|cat26|30.9|
     +----+----+----+-----+----+

And last :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     | t3 |  40| 400|cat26|30.9|
     +----+----+----+-----+----+

PS : in the previous example each combination (a dataset) had at least two rows,

How to achieve just that in JAVA ? Thank you. .

Upvotes: 0

Views: 430

Answers (1)

Coursal
Coursal

Reputation: 1387

First things first, the last combination with the t0, t2, t3 rows is the same as the output in the simple query with the where method, so we need to focus on the other unique pairs we need to find:

  • {t0, t2}
  • {t0, t3}
  • {t2, t3}

We basically need to execute queries on the given DataFrame using the where method once again for every combination, because we can actually specify that we want to return a list with all the rows that match either one or more values from one column through SQL like this WHERE col0 == t0 OR col0 == t2. That means that we need to isolate and retrieve the list of the row values of col0 after the where method you used to filter out everything not matching cat26 in col3 (Java code snippet based on this answer by user12910640).

// Scala
val rowList = input.where("col3 == \"cat26\"")
          .select("col0")
          .rdd
          .map(row => row(0).toString)
          .collect()
          .toList

// Java
List<String> rowList = input.toJavaRDD()
          .where("col3 == \"cat26\"")
          .select("col0")
          .map(new Function<Row, String>() {
              public String call(Row row) {return row.getAs("column_name").toString();}
}).collect();

rowList has 3 elements in it: t0, t2, t3. That means we just need to iterate through this list in a way that we won't have duplicate pairs (e.g. {t0, t2} and {t2, t0}) and we won't have pairs with the same element (e.g. {t0, t0}). To get rid of those two cases, we simply use a second iteration loop for every first iteration that scans for every element after the current element of the first iteration (you can understand it better by looking at the code) and check that the indexes i and j of the iterations are not matching before we execute a query to the DataFrame.

// Scala
for(i <- 0 until rowList.size)
{
    for(j <- i until rowList.size)
    {
        if(!rowList(i).equals(rowList(j)))
            input.where("col0 == \"" + rowList(i) + "\" OR col0 == \"" + rowList(j) + "\"").show()
    }
}

// Java
for(int i = 0; i < rowList.size(); i++)
{
    for(int j = i; j < rowList.size(); j++)
    {
        if(!rowList(i).equals(rowList(j)))
            input.where("col0 == \"" + rowList.get(i) + "\" OR col0 == \"" + rowList.get(j) + "\"").show()
    }
}

Through this loop we can see the desired output DataFrames in the console.

+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t0|  10| 100|cat26|30.9|
|  t2|  30| 300|cat26|30.9|
+----+----+----+-----+----+

21/04/27 16:15:47 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.2.5:36343 in memory (size: 4.1 KiB, free: 292.8 MiB)
+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t0|  10| 100|cat26|30.9|
|  t3|  40| 400|cat26|30.9|
+----+----+----+-----+----+

+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t2|  30| 300|cat26|30.9|
|  t3|  40| 400|cat26|30.9|
+----+----+----+-----+----+

You can alternatively try to store those outputs in a DataFrame list along with the first one that matches the t0, t2, t3 rows, instead of just showing them to the console.

Upvotes: 1

Related Questions