Using combinations in Pyspark

Question

I have the following columns of which I want to make combinations using two elements at a time:

numeric_cols = ['clump_thickness', 'a', 'b']

I am taking combinations using the following function

from itertools import combinations
def combinations2(x):
    return combinations(x,2)

I am using the combinations2 along with map

numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.map(combinations2).flatMap(lambda x: x)

I was expecting an output on length 3 -

[('clump_thickness', 'a'), ('clump_thickness', 'b'), ('a','b')]

But what I get is-

numeric_cols_sc.map(combinations2).flatMap(lambda x: x).take(3)
# [('c', 'l'), ('c', 'u'), ('c', 'm')]

Where am I going wrong?

ernest_k · Accepted Answer

Your use of combinations2 is dissimilar when you do it with spark.

You should either make that list a single record:

numeric_cols_sc = sc.parallelize([numeric_cols])

Or use Spark's operations, such as cartesian (example below will require additional transformation):

numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.cartesian(numeric_cols_sc)

Answers (2)