Reputation: 7967
I have the following columns of which I want to make combinations using two elements at a time:
numeric_cols = ['clump_thickness', 'a', 'b']
I am taking combinations using the following function
from itertools import combinations
def combinations2(x):
return combinations(x,2)
I am using the combinations2
along with map
numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.map(combinations2).flatMap(lambda x: x)
I was expecting an output on length 3 -
[('clump_thickness', 'a'), ('clump_thickness', 'b'), ('a','b')]
But what I get is-
numeric_cols_sc.map(combinations2).flatMap(lambda x: x).take(3)
# [('c', 'l'), ('c', 'u'), ('c', 'm')]
Where am I going wrong?
Upvotes: 0
Views: 1247
Reputation: 901
I have made this algorithm, but with higher numbers looks like that doesn't work or its very slow. It will run in a cluster of big data(cloudera), so I think that I have to put the function into pyspark, please give a hand if you can.
import pandas as pd
import itertools as itts
number_list = [10953, 10423, 10053]
def reducer(nums):
def ranges(n):
print(n)
return range(n, -1, -1)
num_list = list(map(ranges, nums))
return list(itts.product(*num_list))
data=pd.DataFrame(reducer(number_list))
print(data)
Upvotes: 0
Reputation: 45319
Your use of combinations2
is dissimilar when you do it with spark.
You should either make that list a single record:
numeric_cols_sc = sc.parallelize([numeric_cols])
Or use Spark's operations, such as cartesian (example below will require additional transformation):
numeric_cols_sc = sc.parallelize(numeric_cols)
numeric_cols_sc.cartesian(numeric_cols_sc)
Upvotes: 2