Pyspark: multiplying columns from different tables

Question

I have these two dataframes:

df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))

df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))

What I need to do is creating another dataframe whose schema would be

person, dog, score_person * score_dog

so basically multiplying the column score in both dataframes and keeping the two first columns. This multiplication has to take place for each possible couple of factors, i.e. each person with each dog, so that my result dataframe would have 15 rows.

I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.

zero323 · Accepted Answer

Typically Cartesian product is something you want to avoid but simple join without on parameter is all you need here:

df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

Pyspark: multiplying columns from different tables

Answers (2)

Related Questions