How generate unique pairs of values in PySpark

Question

I have a pyspark dataframe as:

+--------+------+
|numbers1|words1|
+--------+------+
|       1| word1|
|       1| word2|
|       1| word3|
|       2| word4|
|       2| word5|
|       3| word6|
|       3| word7|
|       3| word8|
|       3| word9|
+--------+------+

I want to produce another dataframe that would generate all pairs of words in each group. So the result for the above would be:

ID     wordA    wordB
1      word1    word2
1      word1    word3
1      word2    word3
2      word4    word5
3      word6    word7
3      word6    word8
3      word6    word9
3      word7    word8
3      word7    word9
3      word8    word9

I know I can run this with Python with these codes:

from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
                          .apply(pd.Series).stack().reset_index(level=0,name='words')

But now I need to implement this with just Spark APIs and without itertools library. How can I rewrite this script without combinations and using dataframe or RDD?

Lamanus · Accepted Answer

Here is my trial with the dataframe.

import pyspark.sql.functions as f

df.join(df.withColumnRenamed('words1', 'words2'), ['numbers1'], 'outer') \
  .filter('words1 < words2').show(10, False)

+--------+------+------+
|numbers1|words1|words2|
+--------+------+------+
|1       |word1 |word3 |
|1       |word1 |word2 |
|1       |word2 |word3 |
|2       |word4 |word5 |
|3       |word6 |word9 |
|3       |word6 |word8 |
|3       |word6 |word7 |
|3       |word7 |word9 |
|3       |word7 |word8 |
|3       |word8 |word9 |
+--------+------+------+

How generate unique pairs of values in PySpark

Answers (2)

Related Questions