Ajay
Ajay

Reputation: 198

Get distinct Combination of item pairs from Array[String] Column

I have the following dataframe

val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("VV","a",null,"",""))
.toDF("main","value1","value2","value3","value4")
input.show()

+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
|  ZZ|     a|     a|     b|     b|
|  ZZ|     a|     b|     c|     d|
|  YY|     b|     e|  null|     f|
|  YY|     b|     b|  null|     f|
|  VV|     a|  null|      |      |
+----+------+------+------+------+

I did the following to flatten out the data

val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(collect_set('values).alias("values"))
.select('main, flatten($"values").alias("values"))
newdf.show()

+----+--------------------+
|main|              values|
+----+--------------------+
|  ZZ|[a, a, b, b, a, b...|
|  YY|[b, e,, f, b, b,, f]|
|  VV|            [a,, , ]|
+----+--------------------+

Now I need to pick every unique combination of items as a pair from the array and have them as separate rows of data. So for the dataframe given above, the result will be

+----+------+------+
|main|value1|value2|
+----+------+------+
|  ZZ|     a|     b|
|  ZZ|     a|     c|
|  ZZ|     a|     d|
|  ZZ|     b|     c|
|  ZZ|     d|     c|
|  YY|     b|     f|
|  YY|     b|     e|
|  YY|     e|     f|
|  VV|     a|      |
+----+------+------+

How do I reduce to column to unique items that I can split out as separate rows ?

Upvotes: 1

Views: 206

Answers (2)

Lamanus
Lamanus

Reputation: 13551

Use explode two times and filter.

val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(flatten(collect_set('values)).alias("values"))
.withColumn("value1", explode(array_distinct('values)))
.withColumn("value2", explode(array_distinct('values)))
.filter("value1 < value2")
.select('main, 'value1, 'value2)
newdf.show()

+----+------+------+
|main|value1|value2|
+----+------+------+
|  ZZ|     a|     b|
|  ZZ|     a|     c|
|  ZZ|     a|     d|
|  ZZ|     b|     c|
|  ZZ|     b|     d|
|  ZZ|     c|     d|
|  YY|     b|     e|
|  YY|     b|     f|
|  YY|     e|     f|
|  VV|      |     a|
+----+------+------+

Upvotes: 2

Samir Vyas
Samir Vyas

Reputation: 451

I need to pick every unique combination of items as a pair, like if ZZ has values [a,b,c,d] then essentially you will create 6 pairs (4 choose 2)? In such case, you may want to create a UDAF (user defined aggregate function).

input
.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(<here comes your UDAF>)

This UDAF should be such that it collects values as a set (or list then .distinct) and produces all combination pairs (can be done with 2 for loops).

After this point, your dataframe should look like

+----+------------------------------------+
|main|              values                |
+----+------------------------------------+
|  ZZ|[(a,b),(a,c),(a,d),(b,c),(b,d),(c,d)|
+----+------------------------------------+

Then you can .explode() to get a dataframe like

+----+-------+
|main|values |
+----+-------+
|  ZZ|(a,b)  |
+----+-------+
|  ZZ|(a,c)  |
+----+-------+
|  ZZ|(a,d)  |
+----+-------+
|  ZZ|(b,c)  |
+----+-------+
|  ZZ|(b,d)  |
+----+-------+
|  ZZ|(c,d)  |
+----+-------+

Then you can make column value1 as first value of this tuple and column value2 as second value.

Upvotes: 0

Related Questions