Reputation: 321
I have two tables, and for each region in A, I want to find the closest regions in B.
A:
------------------------
ID | Start | End | Color
------------------------
1 | 400 | 500 | White
------------------------
1 | 10 | 20 | Red
------------------------
2 | 2 | 10 | Blue
------------------------
4 | 88 | 90 | Color
------------------------
B:
------------------------
ID | Start | End | Name
------------------------
1 | 1 | 2 | XYZ1
------------------------
1 | 50 | 60 | XYZ4
------------------------
2 | 150 | 160 | ABC1
------------------------
2 | 50 | 60 | ABC2
------------------------
4 | 100 | 120 | EFG
------------------------
RS:
---------------------------------------
ID | Start | End | Color | Closest Name
---------------------------------------
1 | 400 | 500 | White | XYZ4
---------------------------------------
1 | 10 | 20 | Red | XYZ1
---------------------------------------
2 | 2 | 10 | Blue | ABC2
---------------------------------------
4 | 88 | 90 | Color | EFG
---------------------------------------
Currently, I first find min distance by joining two tables:
MinDist Table:
SELECT A.ID, A.Start, A.End,
MIN(CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END) AS distance
FROM ( Select A ... )
Join B On A.ID=B.ID)
Group By A.ID, A.Start, A.End
Then recompute distance for by joining table A and B again, GlobDist Table (Note, the query retrieves B.Name in this case):
SELECT A.ID, A.Start, A.End,
CASE
WHEN (ABS(A.End-B.Start)>=ABS(A.Start - B.End))
THEN ABS(A.Start-B.End)
ELSE ABS(A.End - B.Start)
END AS distance,
B.Name
FROM ( Select A ... )
Join B On A.ID=B.ID)
Finally join these two tables MinDist and GlobDist Tables on
GlobDist.ID= MinDist.ID,
GlobDist.Start=MinDist.Start,
GlobDist.End= MinDist.End,
GlobDist.distance= MinDist.distance.
I tested ROW_NUMBER() and PARTITION BY over (ID, Start, End), but it took much longer. So, what's the fastest and most efficient way of solving this problem? How can I reduce duplicate computation?
Thanks!
Upvotes: 2
Views: 609
Reputation: 172974
Below solution is for BigQuery Standard SQL and as simple and short as below
#standardSQL
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
-- ORDER BY a_id
You can test / play with above using dummy data in your question
#standardSQL
WITH A AS (
SELECT 1 a_id, 400 a_start, 500 a_end, 'White' color UNION ALL
SELECT 1, 10, 20 , 'Red' UNION ALL
SELECT 2, 2, 10, 'Blue' UNION ALL
SELECT 4, 88, 90, 'Color'
), B AS (
SELECT 1 b_id, 1 b_start, 2 b_end, 'XYZ1' name UNION ALL
SELECT 1, 50, 60, 'XYZ4' UNION ALL
SELECT 2, 150, 160,'ABC1' UNION ALL
SELECT 2, 50, 60, 'ABC2' UNION ALL
SELECT 4, 100, 120,'EFG'
)
SELECT a_id, a_start, a_end, color,
ARRAY_AGG(name ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] name
FROM A JOIN B ON a_id = b_id
GROUP BY a_id, a_start, a_end, color
ORDER BY a_id
with result as below
Row a_id a_start a_end color name
1 1 400 500 White XYZ4
2 1 10 20 Red XYZ1
3 2 2 10 Blue ABC2
4 4 88 90 Color EFG
Upvotes: 2