user1877600
user1877600

Reputation: 647

pyspark generate all combinations of unique values

I am trying to generate all combination of unique values within my spark dataframe. The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough. Here is my code:

all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined =  all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")

Is there any way to change this code to more sparkonic one?

======EDIT======

I've also tried to implement such functionalities using the crossJoin function. Here is the the code:

test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)

which for some unknown reason raise following exception:

An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.Integer.valueOf(Integer.java:832)

Upvotes: 3

Views: 6413

Answers (2)

Kay
Kay

Reputation: 1

You can use the readily available cube to get all the possible combinations of pyspark column values. I am also citing a great answer for this topic in this thread

Upvotes: 0

Manrique
Manrique

Reputation: 2221

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

Upvotes: 3

Related Questions