BIGQUERY - Query Exceeded resource limit

Question

I am running the below query to join the two tables and get certain records based on Fuzzy logic (Levenshtein distance)

WITH main_table as (
  select *
from 
`project.data.Roof_Address`
), reference_table as (
  select *
from `project.data.DATA_TREE_Address` 
)
select
  DR_NBR,
  ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,
  ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] distance_score
FROM main_table l
CROSS JOIN reference_table n

GROUP BY 1
having ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] < 10

This query will return the

Project_Id(Dr_NBR)

from first table and

Project_area(LotSizeSqFt)

from second table based on the Levenshtein Score filter at the end.

This query is resulting in the below error

Any suggestions how to optimize the above query?

The distance I am using is from the below function

#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_string1;

  try {
    the_string1 = decodeURI(string1).toLowerCase();
  } catch (ex) {
    the_string1 = string1.toLowerCase();
  }

  try {
    the_string2 = decodeURI(string2).toLowerCase();
  } catch (ex) {
    the_string2 = string2.toLowerCase();
  }

  return Levenshtein.get(the_string1, the_string2) 

""";

Snapshot for Roof_Address table

Snapshot for DATA_TREE_Address

Adrian White · Accepted Answer

The main query cost would most likely be the ORDER by in the :

ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,

I see you're only returning a single record for each array_agg.

I'd recommend removing the ARRAY_AGG and do a MAX or MIN on the results from the EDIT_DISTANCE. A MAX or MIN is much much cheaper than ORDERING ALL records and taking the first or last one.

BIGQUERY - Query Exceeded resource limit

Answers (1)

Related Questions