Reputation: 198
I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("VV","a",null,"",""))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| VV| a| null| | |
+----+------+------+------+------+
I did the following to flatten out the data
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(collect_set('values).alias("values"))
.select('main, flatten($"values").alias("values"))
newdf.show()
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| VV| [a,, , ]|
+----+--------------------+
Now I need to pick every unique combination of items as a pair from the array and have them as separate rows of data. So for the dataframe given above, the result will be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| ZZ| a| c|
| ZZ| a| d|
| ZZ| b| c|
| ZZ| d| c|
| YY| b| f|
| YY| b| e|
| YY| e| f|
| VV| a| |
+----+------+------+
How do I reduce to column to unique items that I can split out as separate rows ?
Upvotes: 1
Views: 206
Reputation: 13551
Use explode
two times and filter.
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(flatten(collect_set('values)).alias("values"))
.withColumn("value1", explode(array_distinct('values)))
.withColumn("value2", explode(array_distinct('values)))
.filter("value1 < value2")
.select('main, 'value1, 'value2)
newdf.show()
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| ZZ| a| c|
| ZZ| a| d|
| ZZ| b| c|
| ZZ| b| d|
| ZZ| c| d|
| YY| b| e|
| YY| b| f|
| YY| e| f|
| VV| | a|
+----+------+------+
Upvotes: 2
Reputation: 451
I need to pick every unique combination of items as a pair
, like if ZZ has values [a,b,c,d] then essentially you will create 6 pairs (4 choose 2)? In such case, you may want to create a UDAF (user defined aggregate function).
input
.select('main,array('value1,'value2,'value3,'value4).alias("values"))
.groupBy('main).agg(<here comes your UDAF>)
This UDAF should be such that it collects values as a set (or list then .distinct
) and produces all combination pairs (can be done with 2 for loops).
After this point, your dataframe should look like
+----+------------------------------------+
|main| values |
+----+------------------------------------+
| ZZ|[(a,b),(a,c),(a,d),(b,c),(b,d),(c,d)|
+----+------------------------------------+
Then you can .explode()
to get a dataframe like
+----+-------+
|main|values |
+----+-------+
| ZZ|(a,b) |
+----+-------+
| ZZ|(a,c) |
+----+-------+
| ZZ|(a,d) |
+----+-------+
| ZZ|(b,c) |
+----+-------+
| ZZ|(b,d) |
+----+-------+
| ZZ|(c,d) |
+----+-------+
Then you can make column value1
as first value of this tuple and column value2
as second value.
Upvotes: 0