Reputation: 425
I have two tables.
Table 1 is a single COLUMN of integers.
Table 2 has three COLUMNS : start_integer, end_integer, data
The simple query is to join the column of integers with data where
integer >= start_integer AND integer <= end_integer
In many SQL implementations this can be accomplished with a left conditional JOIN ... ON BETWEEN
SELECT tbl1.integer, tbl2.data FROM tbl1
LEFT JOIN tbl2 ON tbl1.integer BETWEEN tbl2.start_integer AND
tbl2.end_integer;
But it seems BigQuery supports only JOIN ON with only an = condition.
This could be accomplished with a cross join, but BigQuery complains that my tables are too big. CROSS JOIN EACH is invalid.
How can I accomplish this join task within the limitations of BigQuery's SQL?
Below is my BigQuery SQL:
SELECT tbl1.integer, tbl2.data
FROM bq:data.tbl1
CROSS JOIN bq:data.tbl2
WHERE tbl1.integer BETWEEN tbl2.start_integer AND tbl2.end_integer;
Which returns the error:
Error: 4.1 - 4.132: The JOIN operator's right-side table must be a small table. Switch the tables if the left-side table is smaller, or use JOIN EACH if both tables are larger than the maximum described at http://goo.gl/wXqgHs.
Upvotes: 5
Views: 15860
Reputation: 1558
Just adding outline of how i'm doing this problem - a little hacky but is quickest way i've found that scales well.
Input table looks like:
{
"ip": "130.211.149.140",
"ip_int": "2194904460",
"ip_part1": "130",
"ip_part2": "211",
"ip_part3": "149",
"ip_part4": "140",
"num_requests": "6811"
}
And lookup table is like:
{
"de_ip_key": "DE18_92.66.156.93_92.66.156.112",
"ip_key": "92.66.156.93_92.66.156.112",
"ip_from_int": "1547869277",
"ip_to_int": "1547869296",
"ip_from": "92.66.156.93",
"ip_to": "92.66.156.112",
"naics_code": "518210",
"ip_from_part1": "92",
"ip_from_part2": "66",
"ip_from_part3": "156",
"ip_from_part4": "93",
"ip_to_part1": "92",
"ip_to_part2": "66",
"ip_to_part3": "156",
"ip_to_part4": "112"
}
So using part 1 and part 2 of the ip address to join as a way of reducing the search space (the from and to ranges in my lookup table don't tend to span as wide as to have different part 1's and 2's - if so this approach fails).
select
ip,
ip_int,
-- pick first info from de
first(ip_key) as ip_key,
first(de_ip_key) as de_ip_key,
first(naics_code) as naics_code
from
(
select
ip as ip,
ip_int as ip_int,
ip_key as ip_key,
de_ip_key as de_ip_key,
naics_code as naics_code,
from
-- join based on part 1 and 2 of ip from range
(
select
input.ip as ip,
input.ip_int as ip_int,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
from
[ip.lookup_input_tbl] input
left outer join each
[digital_element.data_naics_code] de
on
input.ip_part1=de.ip_from_part1
and
input.ip_part2=de.ip_from_part2
group by 1,2,3,4,5
),
-- join based on part 1 and 2 of ip to range
(
select
input.ip as ip,
input.ip_int as ip_int,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.ip_key,null) as ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.de_ip_key,null) as de_ip_key,
if(input.ip_int between de.ip_from_int and de.ip_to_int,de.naics_code,null) as naics_code,
from
[ip.lookup_input_tbl] input
left outer join each
[digital_element.data_naics_code] de
on
input.ip_part1=de.ip_to_part1
and
input.ip_part2=de.ip_to_part2
group by 1,2,3,4,5
),
group by 1,2,3,4,5
-- order so null records from either join go to bottom and get left behind on the first group by
order by ip_int,ip_key desc
)
group by 1,2
So it basically blows out the data (by equality join on part 1 and part 2 of ip address and ip_from and ip_to addresses) and then reduces it on the group by using the if between statement (doing this instead of a where condition ensures you get a proper left outer join so you can also see which records you processed but that had no info in the lookup table).
Defo not the prettiest and probably one or two more ways to optimize it but is working for me now and looking up 500K input ip addresses against a lookup table of 16M records in 10-20 seconds.
Upvotes: 0
Reputation: 59165
Good news (2016)! BigQuery does support inequality joins now - make sure to uncheck the "use legacy SQL option".
Example query:
SELECT *
FROM (
SELECT 1 x
) a JOIN (
SELECT 2 y
) b
ON a.x<b.y
With legacy SQL:
Error: ON clause must be AND of = comparisons of one field name from each table, ...
With standard SQL:
1 2
Upvotes: 8
Reputation: 155
Did you try the below query:
SELECT tbl1.integer, tbl2.data
FROM bq:data.tbl1
JOIN EACH bq:data.tbl2
ON tbl1.integer >= tbl2.start_integer AND tbl1.integer <= tbl2.end_integer;
Upvotes: -2