asdasd32
asdasd32

Reputation: 77

The correlated scalar subquery can only contain equality predicates issue with spark sql

select *,
      (select sum(my_val) from my_data data1
      where my_date > date_sub(CAST(data2.start_date as DATE), 180)
      and my_date < date_sub(CAST(data2.start_date as DATE), 90)
      and data2.id = data1.id ) as changed_1,
      (select sum(my_val) from my_data data1
      where my_date > date_sub(CAST(data2.start_date as DATE), 160)
      and my_date < date_sub(CAST(data2.start_date as DATE), 150)
      and data2.id = data1.id ) as changed_2,
      (select sum(my_val) from my_data data1
      where my_date > date_sub(CAST(data2.start_date as DATE), 80)
      and my_date < date_sub(CAST(data2.start_date as DATE), 45)
      and data2.id = data1.id ) as changed_3
      ...<insert a lot of other aggregations based on different date ranges>...
from otherdata_data data2

I am trying to execute the above sql, however I get the following error

The correlated scalar subquery can only contain equality predicates

I have 50 or so aggregate sub queries all based on different parameters between my_data and otherdata_data each aggregation will be on a different date range and some other possible factors. Is it possible to do this in spark sql since it doesnt seem to like non-equalities in sub queries of a predicate?

Upvotes: 0

Views: 145

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269873

Use conditional aggregation:

select data1.id,
       sum(case when my_date > date_sub(CAST(data2.start_date as DATE), 180) and my_date < date_sub(CAST(data2.start_date as DATE), 90)
                then my_val
           end),
       sum(case when my_date > date_sub(CAST(data2.start_date as DATE), 160) and my_date < date_sub(CAST(data2.start_date as DATE), 150)
                then my_val
           end),
       . . . 
from data1
group by data1.id;

You can then join in other tables (perhaps using this as a subquery).

Upvotes: 1

eshirvana
eshirvana

Reputation: 24568

how about chnage your query to this:

select
    data2.*,
       sum(case when data1.my_date > data1.date_sub(CAST(data2.start_date as DATE), 180) and data1.my_date < data1.date_sub(CAST(data2.start_date as DATE), 90) then my_val end)  as changed_1,
       sum(case when data1.my_date > data1.date_sub(CAST(data2.start_date as DATE), 160) and data1.my_date < data1.date_sub(CAST(data2.start_date as DATE), 150) then my_val end) as changed_2,
       sum(case when data1.my_date > data1.date_sub(CAST(data2.start_date as DATE), 80) and data1.my_date < data1.date_sub(CAST(data2.start_date as DATE), 45) then my_val end) as changed_3
from
    otherdata_data data2
    join my_data data1
     on data2.id = data1.id
group by <all columns of otherdata_data table>

Upvotes: 1

Related Questions