Reputation: 6644
Is there an explanation for why the below unhelpful error message displays the way that it does, or is there an otherwise documented best practice for leveraging potentially-missing data in subqueries?
BigQuery yields a confusing and unhelpful error when I reference a field from a table that is left-joined into a CTE.
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
My particular case is an attempt to calculate business days between two dates by leveraging a subquery against an existing dates
table.
The below query throws the following error when I introduce the subquery as a select
ed field.
It appears that the reference to d2
causes the query to break.
with
first_dates as (
select 1 as id, cast('2020-01-01' as date) as d
union all
select 2 as id, cast('2020-02-01' as date) as d
),
second_dates as (
select 1 as id, cast('2020-01-02' as date) as d
),
timelines as (
select
first_dates.id,
first_dates.d as d1,
second_dates.d as d2
from
first_dates
-- This is the only left join in my query, and looks fine to me!
left join second_dates using (id)
)
select
*,
(
select count(*)
from <my_example_dates_table>
where
date >= d1
and date < d2
and is_business_day
) as n_business_days
from
timelines
My current hypothesis is that BigQuery's backend is converting the subquery into a left join and encountering an error when it attempts to do so. If so, I believe the error message should be improved.
I've found many SO links referencing this same issue, but I haven't found any answers that explain why exactly this error arises, why the error message is so unhelpful, and whether there is a best practice for recreating what seems like a reasonable query pattern in BigQuery.
Therefore, I implore the person who will inevitably want to close this as a duplicate to leave it open, since it's specifically asking for an answer as to why the error message is displayed like it is.
Many are answered already suggesting specific alternative approaches to particular problems:
Upvotes: 1
Views: 919
Reputation: 469
Besides @rtenha comment. This is a known issue with Bigquery and can be tracked in here
Also, in these situation is suggested to change <= or >= to =.
Additionally, taking this query as example:
SELECT *, ( SELECT MIN(chosen) FROM chosen_t WHERE chosen >= requested) AS rounded_v FROM tots)
A workaround suggested by the BigQuery Team, is adding fake equality join predicate as this:
SELECT *, ( SELECT MIN(chosen) FROM chosen_t WHERE chosen >= requested and 0 * chosen = 0 * requested) AS rounded_v
Upvotes: 2
Reputation: 3616
First, consider this query (simplified from your query):
with
first_dates as (
select 1 as id, cast('2020-01-01' as date) as d
union all
select 2 as id, cast('2020-02-01' as date) as d
)
select
*,
(
select count(*)
from utility.calendar -- just a simple calendar
where
calendar_date = d
) as n_business_days
from
first_dates
This executes, and if you look at the 'Execution Details' in the BQ interface, you can indeed see a join happening. Your hypothesis here seems to be correct.
If you change the subquery to calendar_date >= d
, then you get the LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join
error. Assuming the subquery-to-left-join transformation actually is occurring, then the question becomes 'Why do I need an equality in my left join?'
I do not know the exact reasons, but my hunch is it has something to do with scalability and query cost estimation. If your right-hand-side table is partitioned, an inequality (>
,<
,etc) join would not benefit from the partitioning as each row on the left-hand-side would have to compare to the entire right-hand-side.
The solution here is to usually CROSS JOIN
and then filter. This moves any comparisons from the join
clause to the where
clause.
select
id, d1, d2, count(*) as ct
from timelines
cross join <my_example_dates_table>
where date >= d1
and date < d2
and is_business_day
group by 1,2,3
Upvotes: 2