ryantuck
ryantuck

Reputation: 6644

Why does BigQuery complain about a left join error when I introduce a subquery?

Question

Is there an explanation for why the below unhelpful error message displays the way that it does, or is there an otherwise documented best practice for leveraging potentially-missing data in subqueries?

Issue

BigQuery yields a confusing and unhelpful error when I reference a field from a table that is left-joined into a CTE.

Error message

LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.

Particular goal

My particular case is an attempt to calculate business days between two dates by leveraging a subquery against an existing dates table.

Reproducible example

The below query throws the following error when I introduce the subquery as a selected field.

It appears that the reference to d2 causes the query to break.

with

first_dates as (
  select 1 as id, cast('2020-01-01' as date) as d
  union all
  select 2 as id, cast('2020-02-01' as date) as d
),

second_dates as (
  select 1 as id, cast('2020-01-02' as date) as d
),


timelines as (
  select 
    first_dates.id,
    first_dates.d as d1,
    second_dates.d as d2
  from
    first_dates
    -- This is the only left join in my query, and looks fine to me!
    left join second_dates using (id)
)

select 
  *,
  (
    select count(*)
    from <my_example_dates_table>
    where 
      date >= d1 
      and date < d2
      and is_business_day
      ) as n_business_days
from 
  timelines

Hypothesis

My current hypothesis is that BigQuery's backend is converting the subquery into a left join and encountering an error when it attempts to do so. If so, I believe the error message should be improved.

Research so far

I've found many SO links referencing this same issue, but I haven't found any answers that explain why exactly this error arises, why the error message is so unhelpful, and whether there is a best practice for recreating what seems like a reasonable query pattern in BigQuery.

Therefore, I implore the person who will inevitably want to close this as a duplicate to leave it open, since it's specifically asking for an answer as to why the error message is displayed like it is.

Many are answered already suggesting specific alternative approaches to particular problems:

Upvotes: 1

Views: 919

Answers (2)

ebeltran
ebeltran

Reputation: 469

Besides @rtenha comment. This is a known issue with Bigquery and can be tracked in here

Also, in these situation is suggested to change <= or >= to =.

Additionally, taking this query as example:

SELECT *, ( SELECT MIN(chosen) FROM chosen_t WHERE chosen >= requested) AS rounded_v FROM tots)

A workaround suggested by the BigQuery Team, is adding fake equality join predicate as this:

SELECT *, ( SELECT MIN(chosen) FROM chosen_t WHERE chosen >= requested and 0 * chosen = 0 * requested) AS rounded_v

Upvotes: 2

rtenha
rtenha

Reputation: 3616

First, consider this query (simplified from your query):

with

first_dates as (
  select 1 as id, cast('2020-01-01' as date) as d
  union all
  select 2 as id, cast('2020-02-01' as date) as d
)

select 
  *,
  (
    select count(*)
    from utility.calendar -- just a simple calendar
    where 
      calendar_date = d
      ) as n_business_days
from 
  first_dates

This executes, and if you look at the 'Execution Details' in the BQ interface, you can indeed see a join happening. Your hypothesis here seems to be correct.

If you change the subquery to calendar_date >= d, then you get the LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join error. Assuming the subquery-to-left-join transformation actually is occurring, then the question becomes 'Why do I need an equality in my left join?'

I do not know the exact reasons, but my hunch is it has something to do with scalability and query cost estimation. If your right-hand-side table is partitioned, an inequality (>,<,etc) join would not benefit from the partitioning as each row on the left-hand-side would have to compare to the entire right-hand-side.

The solution here is to usually CROSS JOIN and then filter. This moves any comparisons from the join clause to the where clause.

select
  id, d1, d2, count(*) as ct
from timelines
cross join <my_example_dates_table>
where date >= d1 
  and date < d2
  and is_business_day
group by 1,2,3

Upvotes: 2

Related Questions