Reputation: 9804
I have these two dataframes:
df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))
df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))
What I need to do is creating another dataframe whose schema would be
person, dog, score_person * score_dog
so basically multiplying the column score
in both dataframes and keeping the two first columns. This multiplication has to take place for each possible couple of factors, i.e. each person with each dog, so that my result dataframe would have 15 rows.
I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.
Upvotes: 3
Views: 7008
Reputation: 441
Looks like this question is a few years old but found an explicit Cross Join method added in version 2.1. Try:
df1.crossJoin(df2).select("person", "dog", (df1.score * df2.score).alias("product"))
found information here: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin
Upvotes: 1
Reputation: 330123
Typically Cartesian product is something you want to avoid but simple join
without on
parameter is all you need here:
df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))
Upvotes: 6