Reputation: 647
I am trying to generate all combination of unique values within my spark dataframe. The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough. Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function. Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)
Upvotes: 3
Views: 6413
Reputation: 1
You can use the readily available cube to get all the possible combinations of pyspark column values. I am also citing a great answer for this topic in this thread
Upvotes: 0
Reputation: 2221
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
Upvotes: 3