Reputation: 618
A test that uses spark-sql
is failing due to the following error:
Correlated scalar subqueries must be Aggregated
The relevant part of the SQL query that is failing:
COALESCE(
CASE WHEN ndc_code IN
(SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc')
THEN '180'
END,
CASE WHEN hcpcs_code IN
(SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'hcpcs')
THEN '180'
END,
med_order_end_dttm,
CASE WHEN days_supply IS NOT NULL
THEN
CASE
WHEN discontinued AND (med_order_recorded_dttm BETWEEN medication_start AND start_plus)
THEN med_order_recorded_dttm
WHEN discontinued AND (med_order_recorded_dttm > start_plus)
THEN start_plus
WHEN NOT discontinued
THEN start_plus
END
END,
medication_start
)
The first two cases
in the coalesce
are what I added and caused the test to fail. I believe it is because of how the subqueries will return multiple rows but I'm not sure how to get around that. Any help is appreciated. If it makes a difference, this SQL
is only run in spark for the tests, it is actually run on a snowflake warehouse in production.
Upvotes: 1
Views: 2651
Reputation: 74669
I'm confused to see the error as I don't see how the subqueries could've been considered correlated. What are they correlated on?
Anyway, to get rid of the error (which I think happens at analysis time which is before the query gets executed) is to use any standard aggregate function that assure Spark Analyzer that the queries do return just a single row with a single column or no row at all (that would give null
as the result).
You could use first
or max
standard aggregate functions as follows:
SELECT first(CODE) FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc'
I don't think Spark should report the exception since I can't find any correlated queries and using SQL IN
should really accept any number of values from a subquery. I am confused.
Upvotes: 1
Reputation: 1460
Instead of using IN
, use EXISTS
in the first two CASE
statements. Try this:
COALESCE(
CASE WHEN EXISTS( SELECT 1 FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc' AND CODE = ndc_code)
THEN '180'
END,
CASE WHEN EXISTS( SELECT 1 FROM BOC_MED_METADATA WHERE CODE_TYPE = 'hcpcs' AND CODE = hcpcs_code)
THEN '180'
END,
med_order_end_dttm,
CASE WHEN days_supply IS NOT NULL
THEN
CASE
WHEN discontinued AND (med_order_recorded_dttm BETWEEN medication_start AND start_plus)
THEN med_order_recorded_dttm
WHEN discontinued AND (med_order_recorded_dttm > start_plus)
THEN start_plus
WHEN NOT discontinued
THEN start_plus
END
END,
medication_start
)
If the above fails, try qualifying ndc_code
and hcpcs_code
with the table name.
Upvotes: 0
Reputation: 1269873
If this code:
CASE WHEN ndc_code IN
(SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc')
THEN '180'
Is generating an error of the form
Correlated scalar subqueries must be Aggregated
And the error message is accurate, then either CODE
or CODETYPE
is missing from the table. So, my first recommendation is to qualify all column names:
CASE WHEN ndc_code IN
(SELECT bmm.CODE FROM BOC_MED_METADATA bmm WHERE bmm.CODE_TYPE = 'ndc')
THEN '180'
This should return an error that one or the other columns are not found.
Upvotes: 0