Clock Slave
Clock Slave

Reputation: 7967

Using combinations in Pyspark

I have the following columns of which I want to make combinations using two elements at a time:

numeric_cols = ['clump_thickness', 'a', 'b']

I am taking combinations using the following function

from itertools import combinations
def combinations2(x):
    return combinations(x,2)

I am using the combinations2 along with map

numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.map(combinations2).flatMap(lambda x: x)

I was expecting an output on length 3 -

[('clump_thickness', 'a'), ('clump_thickness', 'b'), ('a','b')]

But what I get is-

numeric_cols_sc.map(combinations2).flatMap(lambda x: x).take(3)
# [('c', 'l'), ('c', 'u'), ('c', 'm')]

Where am I going wrong?

Upvotes: 0

Views: 1247

Answers (2)

Andy Quiroz
Andy Quiroz

Reputation: 901

I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow. It will run in a cluster of big data(cloudera), so I think that I have to put the function into pyspark, please give a hand if you can.

import pandas as pd
import itertools as itts

number_list = [10953, 10423, 10053]

def reducer(nums):
  def ranges(n):
    print(n)
    return range(n, -1, -1)

  num_list = list(map(ranges, nums))
  return list(itts.product(*num_list))

data=pd.DataFrame(reducer(number_list))
print(data)

Upvotes: 0

ernest_k
ernest_k

Reputation: 45319

Your use of combinations2 is dissimilar when you do it with spark.

You should either make that list a single record:

numeric_cols_sc = sc.parallelize([numeric_cols])

Or use Spark's operations, such as cartesian (example below will require additional transformation):

numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.cartesian(numeric_cols_sc)

Upvotes: 2

Related Questions