Clarification on usage of spark dataframe in scala

Question

I have 3 dataframes 'u', 'join5' and site.

Here is the schema of dataframe 'u'.

scala> println(u.printSchema)
root
 |-- split_sk: integer (nullable = true)
 |-- new_date: string (nullable = true)

Now creating join6 with joining 'join5' and 'site' dataframes. Here are my 2 questions -

What is 'u("split_sk")' here in the below query? Is this possible to use the column of dataframe 'u' to compare randomly when join with 'u' is not clearly given in query?
What (<=>) sign represents to in scala and in particular in below query?

val join6 = join5.join(site, u("split_sk") <=> site("split_key") &&($"new_date" >= $"effective_dt") && ($"new_date" <=  $"expiry_dt"),"left")

Sarath Chandra Vema · Accepted Answer

For question 1,

Yes, "split_sk" is the column in "u". This is similar to SQL, a.column1 = b.column2. It is spark way of specifying the same as above.

To answer another question, Yes it is possible to specify some column of a dataframe that is not present in the query. Most likely the scenario is join5 dataframe is created on top of join5

For question 2,

<=> is called NULL SAFE join. Refer to this Spark SQL "<=>" operator

Clarification on usage of spark dataframe in scala

Answers (1)

Related Questions