Raghavendra Gupta
Raghavendra Gupta

Reputation: 355

Clarification on usage of spark dataframe in scala

I have 3 dataframes 'u', 'join5' and site.

Here is the schema of dataframe 'u'.

scala> println(u.printSchema)
root
 |-- split_sk: integer (nullable = true)
 |-- new_date: string (nullable = true)

Now creating join6 with joining 'join5' and 'site' dataframes. Here are my 2 questions -

val join6 = join5.join(site, u("split_sk") <=> site("split_key") &&($"new_date" >= $"effective_dt") && ($"new_date" <=  $"expiry_dt"),"left")

Upvotes: 0

Views: 57

Answers (1)

Sarath Chandra Vema
Sarath Chandra Vema

Reputation: 812

For question 1,

Yes, "split_sk" is the column in "u". This is similar to SQL, a.column1 = b.column2. It is spark way of specifying the same as above.

To answer another question, Yes it is possible to specify some column of a dataframe that is not present in the query. Most likely the scenario is join5 dataframe is created on top of join5

For question 2,

<=> is called NULL SAFE join. Refer to this Spark SQL "<=>" operator

Upvotes: 1

Related Questions