shivanshu
shivanshu

Reputation: 67

BIGQUERY - Query Exceeded resource limit

I am running the below query to join the two tables and get certain records based on Fuzzy logic (Levenshtein distance)

WITH main_table as (
  select *
from 
`project.data.Roof_Address`
), reference_table as (
  select *
from `project.data.DATA_TREE_Address` 
)
select
  DR_NBR,
  ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,
  ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] distance_score
FROM main_table l
CROSS JOIN reference_table n

GROUP BY 1
having ARRAY_AGG(
    EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname)  LIMIT 1
  )[OFFSET(0)] < 10

This query will return the

Project_Id(Dr_NBR)

from first table and

Project_area(LotSizeSqFt)

from second table based on the Levenshtein Score filter at the end.

This query is resulting in the below error

enter image description here

Any suggestions how to optimize the above query?

The distance I am using is from the below function

#standardSQL
CREATE TEMPORARY FUNCTION EDIT_DISTANCE(string1 STRING, string2 STRING)
RETURNS INT64
LANGUAGE js AS """
  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_string1;

  try {
    the_string1 = decodeURI(string1).toLowerCase();
  } catch (ex) {
    the_string1 = string1.toLowerCase();
  }

  try {
    the_string2 = decodeURI(string2).toLowerCase();
  } catch (ex) {
    the_string2 = string2.toLowerCase();
  }

  return Levenshtein.get(the_string1, the_string2) 

""";

Snapshot for Roof_Address table

enter image description here

Snapshot for DATA_TREE_Address

enter image description here

Upvotes: 1

Views: 315

Answers (1)

Adrian White
Adrian White

Reputation: 1804

The main query cost would most likely be the ORDER by in the :

ARRAY_AGG(
    STRUCT(n.LotSizeSqFt)
    ORDER BY EDIT_DISTANCE(l.ordered_fullname, n.ordered_fullname) LIMIT 1
  )[OFFSET(0)].*,

I see you're only returning a single record for each array_agg.

I'd recommend removing the ARRAY_AGG and do a MAX or MIN on the results from the EDIT_DISTANCE. A MAX or MIN is much much cheaper than ORDERING ALL records and taking the first or last one.

Upvotes: 1

Related Questions