Reputation: 35
Consider the following Pyspark dataframe
Col1 | Col2 | Col3 |
---|---|---|
A | D | G |
B | E | H |
C | F | I |
How can I create the following dataframe which has all pairwise combinations of all the columns?
Col1 | Col2 | Col3 | Col1_Col2_cross | Col1_Col3_cross | Col2_Col3_cross |
---|---|---|---|---|---|
A | D | G | A,D | A,G | D,G |
B | E | H | B,E | B,H | E,H |
C | F | I | C,F | C,I | F,I |
Upvotes: 0
Views: 935
Reputation: 42352
You can generate column combinations using itertools
:
import pyspark.sql.functions as F
import itertools
df2 = df.select(
'*',
*[F.concat_ws(',', x[0], x[1]).alias(x[0] + '_' + x[1] + '_cross')
for x in itertools.combinations(df.columns, 2)]
)
df2.show()
+----+----+----+---------------+---------------+---------------+
|Col1|Col2|Col3|Col1_Col2_cross|Col1_Col3_cross|Col2_Col3_cross|
+----+----+----+---------------+---------------+---------------+
| A| D| G| A,D| A,G| D,G|
| B| E| H| B,E| B,H| E,H|
| C| F| I| C,F| C,I| F,I|
+----+----+----+---------------+---------------+---------------+
Upvotes: 0