Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

Question

I have read that in spark you can easily do a correlated scalar subquery like so:

select
    column1,
    (select column2 from table2 where table2.some_key = table1.id)
from table1

What I have not figured out is how to do this in the DataFrame API. The best I can come up with is to do a join. The problem with this is that in my specific case I am joining with a enum-like lookup table that actually applies to more than one column.

Below is an example of the DataFrame code.

Dataset table1 = getTable1FromSomewhere();
Dataset table2 = getTable2FromSomewhere();

table1
    .as("table1")
    .join(table2.as("table2"), 
            col("table1.first_key").equalTo(col("table2.key")), "left")
    .join(table2.as("table3"), 
            col("table1.second_key").equalTo(col("table3.key")), "left")
    .select(col("table1.*"), 
            col("table2.description").as("first_key_description"),
            col("table3.description").as("second_key_description"))
    .show();

Any help would be greatly appreciated on figuring out how to do this in the DataFrame API.

user9969978 · Accepted Answer

What I have not figured out is how to do this in the DataFrame API.

Because there is simply no DataFrame API that can express that directly (without explicit JOIN). It can possibly change in the future:

Perform a Correlated Scalar SubQuery in Spark Dataframe Java API (spark v2.3.0)

Answers (1)

Related Questions