martina.physics
martina.physics

Reputation: 9804

Pyspark: multiplying columns from different tables

I have these two dataframes:

df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))

df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))

What I need to do is creating another dataframe whose schema would be

person, dog, score_person * score_dog

so basically multiplying the column score in both dataframes and keeping the two first columns. This multiplication has to take place for each possible couple of factors, i.e. each person with each dog, so that my result dataframe would have 15 rows.

I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.

Upvotes: 3

Views: 7008

Answers (2)

Stein
Stein

Reputation: 441

Looks like this question is a few years old but found an explicit Cross Join method added in version 2.1. Try:

df1.crossJoin(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

found information here: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin

Upvotes: 1

zero323
zero323

Reputation: 330123

Typically Cartesian product is something you want to avoid but simple join without on parameter is all you need here:

df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

Upvotes: 6

Related Questions