Karthick
Karthick

Reputation: 11

Why is the count different for cross join in Pyspark w or w/o join condition?

dfj3 = spark.createDataFrame(
    ['a','b','b'],StringType()
    )

dfj4 = spark.createDataFrame(
    ['c','d','e'],StringType()
)

dfj3.join(dfj4).count() // #crossjoin, count = 9
dfj3.join(dfj4,dfj3.value==dfj4.value).count() #innerjoin, count = 0
dfj3.join(dfj4,dfj3.value==dfj4.value,'cross').count() #crossjoin with condition, count = 0

Why is 1st and 3rd cross join working differently?

Expected Cross join with join condition and cross join without join condition should be the same as the joins will be performed for all the records in both tables.

Upvotes: 1

Views: 61

Answers (1)

Vikas Sharma
Vikas Sharma

Reputation: 2141

All joins are a subset of the mathematical construct called cartesian product - cross join in SQL terms.

So, when you are applying the equality condition on the cross join - it's just behaving like an equi join - a class of inner join - and hence the result is different. This is also mentioned in the Join Wikipedia page:

CROSS JOIN does not itself apply any predicate to filter rows from the joined table. The results of a CROSS JOIN can be filtered using a WHERE clause, which may then produce the equivalent of an inner join.

Upvotes: 0

Related Questions