Grayson
Grayson

Reputation: 73

BigQuery UDF memory exceeded error on multiple rows but works fine on single row

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.

For example, the query

select 
    user.ga_user_id, 
    ga_session_id, 
        ...
from 
    temp_ga_processing(
        select 
            fullVisitorId, 
            visitNumber, 
                   ...            
        from [79689075.ga_sessions_20160201] limit 100)

returns the error, but

from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1) 

does not.

I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.

The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:

function temp_ga_processing(row, emit) {
  topic_id = -1;
  hit_numbers = [];
  first_page_load_hits = [];
  return_dict = {};
  return_dict["user"] = {};
  return_dict["user"]["ga_user_id"] = row.fullVisitorId;
  return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
  for(i=0;i<row.hits.length;i++) {
    hit_dict = {};
    hit_dict["page"] = {};
    hit_dict["time"] = row.hits[i].time;
    hit_dict["type"] = row.hits[i].type;
    hit_dict["page"]["engaged_10s"] = false;
    hit_dict["page"]["engaged_30s"] = false;
    hit_dict["page"]["engaged_60s"] = false;

    add_hit = true;
    for(j=0;j<row.hits[i].customMetrics.length;j++) {
      if(row.hits[i].customDimensions[j] != null) {
        if(row.hits[i].customMetrics[j]["index"] == 3) {
          metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
          hit_dict["metrics"] = metrics;
          metrics = null;
          row.hits[i].customDimensions[j] = null;
        }
      }
    }

    hit_dict["topic"] = {};
    hit_dict["doctor"] = {};
    hit_dict["doctor_location"] = {};
    hit_dict["content"] = {};

    if(row.hits[i].customDimensions != null) {
      for(j=0;j<row.hits[i].customDimensions.length;j++) {
        if(row.hits[i].customDimensions[j] != null) {
          if(row.hits[i].customDimensions[j]["index"] == 1) {
            hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
            row.hits[i].customDimensions[j] = null;
            continue;
          }
          if(row.hits[i].customDimensions[j]["index"] == 3) {
            if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
              return_dict["logged_in_as_doctor"] = true;
            }
          }
          // and so on...
        }
      }
    }
    if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
      elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
      if(elre != null) {
        if(parseInt(elre[0].split(":")[1]) <= 60000) {
          first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
          if(hit_dict["page"]["page_load"] == null) {
            hit_dict["page"]["page_load"] = {};
          }
          hit_dict["page"]["page_load"]["sample"] = 1;
          page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
          if(page_load_time_re != null) {
            hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
          }
        }
        // and so on...  
      }
    }    
  row = null;
  emit return_dict;
}

The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

Upvotes: 2

Views: 1850

Answers (3)

thomaspark
thomaspark

Reputation: 488

Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!

Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

@Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?


We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.

In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.

Upvotes: 1

Lloyd Tabb
Lloyd Tabb

Reputation: 86

I love the concept of parsing my logs in BigQuery, but I've got the same problem, I get

Error: Resources exceeded during query execution.

The Job Id is bigquery-looker:bquijob_260be029_153dd96cfdb, if that at all helps.

I wrote a very basic parser does a simple match and returns rows. Works just fine on a 10K row data set, but I get out of resources when trying to run against a 3M row logfile.

Any suggestions for a work around?

Here is the javascript code.

function parseLogRow(row, emit) {

  r =  (row.logrow ? row.logrow : "") + (typeof row.l2 !== "undefined" ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
  ts = null
  category = null
  user = null
  message = null
  db = null
  found = false
  if (r) {
      m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] :: (.*)/ )
      if( m){
        ts = new Date(m[1])/1000
        category = m[3] || null
        user = m[4] || null
        db = m[5] || null
        message = m[6] || null
        found = true
      }
      else {
        message = r
        found = false
      }
   }

  emit({
    ts:  ts,
    category: category,
    user: user,
    db: db,
    message: message,
    found: found
    });
}

bigquery.defineFunction(
  'parseLogRow',                           // Name of the function exported to SQL
  ['logrow',"l2","l3"],                    // Names of input columns
  [
    {'name': 'ts', 'type': 'timestamp'},  // Output schema
    {'name': 'category', 'type': 'string'},
    {'name': 'user', 'type': 'string'},
    {'name': 'db', 'type': 'string'},
    {'name': 'message', 'type': 'string'},
    {'name': 'found', 'type': 'boolean'},
  ],
  parseLogRow                          // Reference to JavaScript UDF
);

Upvotes: 0

user3688176
user3688176

Reputation: 327

A UDF will fail on anything but very small datasets if it has a lot of if/then levels, such as:
if () {
.... if() {
.........if () {
etc

We had to track down and remove the deepest if/then statement.

But, that is not enough. In addition, when you pass the data into the UDF run a "GROUP EACH BY" on all the variables. This will force BQ to send the output to multiple "workers". Otherwise it will also fail.

I've wasted 3 days of my life on this annoying bug. Argh.

Upvotes: 0

Related Questions