Brett Bonner
Brett Bonner

Reputation: 425

conditional join in bigquery

I have two tables.

Table 1 is a single COLUMN of integers.

Table 2 has three COLUMNS : start_integer, end_integer, data

The simple query is to join the column of integers with data where

  integer >= start_integer AND integer <= end_integer

In many SQL implementations this can be accomplished with a left conditional JOIN ... ON BETWEEN

SELECT tbl1.integer, tbl2.data FROM tbl1
LEFT JOIN tbl2 ON tbl1.integer BETWEEN tbl2.start_integer AND 
tbl2.end_integer;

But it seems BigQuery supports only JOIN ON with only an = condition.

This could be accomplished with a cross join, but BigQuery complains that my tables are too big. CROSS JOIN EACH is invalid.

How can I accomplish this join task within the limitations of BigQuery's SQL?

Below is my BigQuery SQL:

SELECT tbl1.integer, tbl2.data
FROM bq:data.tbl1 
CROSS JOIN bq:data.tbl2
WHERE tbl1.integer BETWEEN tbl2.start_integer AND tbl2.end_integer;

Which returns the error:

Error: 4.1 - 4.132: The JOIN operator's right-side table must be a small table. Switch the tables if the left-side table is smaller, or use JOIN EACH if both tables are larger than the maximum described at http://goo.gl/wXqgHs.

Upvotes: 5

Views: 15860

Answers (4)

andrewm4894
andrewm4894

Reputation: 1558

Just adding outline of how i'm doing this problem - a little hacky but is quickest way i've found that scales well.

Input table looks like:

{
    "ip": "130.211.149.140",
    "ip_int": "2194904460",
    "ip_part1": "130",
    "ip_part2": "211",
    "ip_part3": "149",
    "ip_part4": "140",
    "num_requests": "6811"
  }

And lookup table is like:

{
    "de_ip_key": "DE18_92.66.156.93_92.66.156.112",
    "ip_key": "92.66.156.93_92.66.156.112",
    "ip_from_int": "1547869277",
    "ip_to_int": "1547869296",
    "ip_from": "92.66.156.93",
    "ip_to": "92.66.156.112",
    "naics_code": "518210",
    "ip_from_part1": "92",
    "ip_from_part2": "66",
    "ip_from_part3": "156",
    "ip_from_part4": "93",
    "ip_to_part1": "92",
    "ip_to_part2": "66",
    "ip_to_part3": "156",
    "ip_to_part4": "112"
  }

So using part 1 and part 2 of the ip address to join as a way of reducing the search space (the from and to ranges in my lookup table don't tend to span as wide as to have different part 1's and 2's - if so this approach fails).

select
  ip,
  ip_int,
  -- pick first info from de
  first(ip_key) as ip_key,
  first(de_ip_key) as de_ip_key,
  first(naics_code) as naics_code
from
  (
  select 
    ip as ip,
    ip_int as ip_int,
    ip_key as ip_key,
    de_ip_key as de_ip_key,
    naics_code as naics_code,
  from 
    -- join based on part 1 and 2 of ip from range
    (
    select 
      input.ip as ip,
      input.ip_int as ip_int,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
    from
      [ip.lookup_input_tbl]  input
    left outer join each 
      [digital_element.data_naics_code] de
    on
      input.ip_part1=de.ip_from_part1
      and
      input.ip_part2=de.ip_from_part2
    group by 1,2,3,4,5
    ),
    -- join based on part 1 and 2 of ip to range
    (
    select 
      input.ip as ip,
      input.ip_int as ip_int,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
      if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
    from
      [ip.lookup_input_tbl]  input
    left outer join each 
      [digital_element.data_naics_code] de
    on
      input.ip_part1=de.ip_to_part1
      and
      input.ip_part2=de.ip_to_part2
    group by 1,2,3,4,5
    ),
  group by 1,2,3,4,5
  -- order so null records from either join go to bottom and get left behind on the first group by
  order by ip_int,ip_key desc
  )
group by 1,2

So it basically blows out the data (by equality join on part 1 and part 2 of ip address and ip_from and ip_to addresses) and then reduces it on the group by using the if between statement (doing this instead of a where condition ensures you get a proper left outer join so you can also see which records you processed but that had no info in the lookup table).

Defo not the prettiest and probably one or two more ways to optimize it but is working for me now and looking up 500K input ip addresses against a lookup table of 16M records in 10-20 seconds.

Upvotes: 0

Felipe Hoffa
Felipe Hoffa

Reputation: 59165

Good news (2016)! BigQuery does support inequality joins now - make sure to uncheck the "use legacy SQL option".

Example query:

SELECT * 
FROM (
  SELECT 1 x
) a JOIN (
  SELECT 2 y
) b
ON a.x<b.y

With legacy SQL:

Error: ON clause must be AND of = comparisons of one field name from each table, ...

With standard SQL:

1     2

enter image description here

Upvotes: 8

Brett Bonner
Brett Bonner

Reputation: 425

BigQuery does not support cross joins on right-side tables.

Upvotes: 2

Cihan Fethi Hızar
Cihan Fethi Hızar

Reputation: 155

Did you try the below query:

SELECT tbl1.integer, tbl2.data
FROM bq:data.tbl1 
JOIN EACH bq:data.tbl2
ON tbl1.integer >= tbl2.start_integer AND tbl1.integer <= tbl2.end_integer;

Upvotes: -2

Related Questions