AmirCS
AmirCS

Reputation: 331

Avoid Duplicate Computation - BigQuery

I have two tables, and for each region in A, I want to find the closest regions in B.

Table A:

------------------------
ID | Start | End | Color 
------------------------
 1 |  400  | 500 | White
------------------------
 1 |  10   | 20  | Red 
------------------------
 2 |   2   |  10 | Blue 
------------------------
 4 |   88  |  90 | Color 
------------------------

Table B:

-------------------------------
ID | Start | End | Name | Name2 
-------------------------------
 1 |  1    | 2   | XYZ1 | EWQ
-------------------------------
 1 |  50   | 60  | XYZ4 | EWY
-------------------------------
 2 |  150  | 160 | ABC1 | TRE
-------------------------------
 2 |  50   | 60  | ABC2 | YUE
-------------------------------
 4 |  100  | 120 | EFG  | MMN
-------------------------------

Here is the result table:

-------------------------------------------------------
ID | Start | End | Color | Closest Name | Closest Name2
-------------------------------------------------------
 1 |  400  | 500 | White |   XYZ4       |   EWY
-------------------------------------------------------
 1 |  10   | 20  | Red   |   XYZ1       |  EWQ
-------------------------------------------------------
 2 |   2   |  10 | Blue  |   ABC2       |  YUE
-------------------------------------------------------
 4 |   88  |  90 | Color |   EFG        |  MMN
-------------------------------------------------------

Here is the current solution:

#standardSQL
SELECT
  A.ID,
  A.Start,
  A.END,
  ARRAY_AGG(B.name
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END)) THEN ABS(A.Start-B.END) ELSE ABS(A.END-B.Start) END)  
  LIMIT
    1)[SAFE_OFFSET(0)] name,
  ARRAY_AGG(B.name2
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END)) THEN ABS(A.Start-B.END) ELSE ABS(A.END-B.Start) END)  
  LIMIT
    1)[SAFE_OFFSET(0)] name2

FROM
     A
JOIN
     B
ON
  A.ID = B.ID

  WHERE  (A.start>B.End) OR (B.Start> A.END)
GROUP BY
  A.ID,
  A.start,
  A.END

In this case, we have only two fields (name and name2); if B has N fields, then is there any way that we can avoid duplicate computation?

Thanks!

Upvotes: 1

Views: 81

Answers (3)

Daria
Daria

Reputation: 606

Make a struct:

ARRAY_AGG((B.name, B.name2)
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END)) THEN ABS(A.Start-B.END) ELSE ABS(A.END-B.Start) END)  
  LIMIT
    1)[SAFE_OFFSET(0)] names,

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173181

below should give you an idea

#standardSQL
WITH A AS (
  SELECT 1 a_id, 400 a_start, 500 a_end, 'White' color UNION ALL
  SELECT 1, 10,  20  , 'Red' UNION ALL
  SELECT 2, 2,   10, 'Blue' UNION ALL
  SELECT 4, 88,  90, 'Color'
), B AS (
  SELECT 1 b_id, 1 b_start, 2 b_end, 'XYZ1' name, 'EWQ' name2 UNION ALL
  SELECT 1, 50, 60,  'XYZ4', 'EWY' UNION ALL
  SELECT 2, 150, 160,'ABC1', 'TRE' UNION ALL
  SELECT 2, 50, 60,  'ABC2', 'YUE' UNION ALL
  SELECT 4, 100, 120,'EFG', 'MMN'
)
SELECT 
  a_id, a_start, a_end, color, names.name, names.name2
FROM (
  SELECT a_id, a_start, a_end, color,  
    ARRAY_AGG(STRUCT(name, name2) ORDER BY POW(ABS(a_start - b_start), 2) + POW(ABS(a_end - b_end), 2) LIMIT 1)[SAFE_OFFSET(0)] names
  FROM A JOIN B ON a_id = b_id
  GROUP BY a_id, a_start, a_end, color
)
ORDER BY a_id  

with result as

Row a_id    a_start a_end   color   name    name2    
1   1       400     500     White   XYZ4    EWY  
2   1       10      20      Red     XYZ1    EWQ  
3   2       2       10      Blue    ABC2    YUE  
4   4       88      90      Color   EFG     MMN  

Upvotes: 2

Elliott Brossard
Elliott Brossard

Reputation: 33765

You should be able to use ARRAY_AGG with a STRUCT instead. Here are few sample expressions:

ARRAY_AGG(
  B
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END))
      THEN ABS(A.Start-B.END)
      ELSE ABS(A.END-B.Start)
    END)  
  LIMIT
    1)[SAFE_OFFSET(0)].*

This returns all of the fields inside B for the first instance of B based on the ordering.

ARRAY_AGG(
  (SELECT AS STRUCT B.* EXCEPT(foo, bar))
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END))
      THEN ABS(A.Start-B.END)
      ELSE ABS(A.END-B.Start)
    END)  
  LIMIT
    1)[SAFE_OFFSET(0)].*

This returns all fields inside B except foo and bar (you can replace these names with whatever you want to exclude).

ARRAY_AGG(
  STRUCT(B.name, B.name2, B.foo, B.bar)
  ORDER BY
   (CASE WHEN (ABS(A.END-B.Start) >= ABS(A.Start - B.END))
      THEN ABS(A.Start-B.END)
      ELSE ABS(A.END-B.Start)
    END)  
  LIMIT
    1)[SAFE_OFFSET(0)].*

This returns just the named fields from B. You can list whichever ones you want.

Upvotes: 1

Related Questions