Andrew Backes
Andrew Backes

Reputation: 101

bigquery joins on nested repeated

I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.

For my example I'll call the two tables being joined A and B.

Records in table A look something like:

{
  "url":"some url",
  "repeated_nested": [
    {"key":"some key","property":"some property"}
  ]
}

and records in table B look something like:

{
  "key":"some key",
  "property2": "another property"
}

I am hoping to find a way to join this data together to generate a row that looks like:

{
  "url":"some url",
  "repeated_nested": [
    {
      "key":"some key",
      "property":"some property",
      "property2":"another property"
    }
  ]
}

The very first query I tried was:

SELECT 
  url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
  AS lefttable
LEFT OUTER JOIN B
  AS righttable
  ON lefttable.key=righttable.key

This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN on repeated_nested then I'm not sure how to get the original row put back together correctly.

The data is such that a url will always have the same repeated_nested field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:

SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2

FROM
JS(
   (
    SELECT basetable.url as url, repeated_nested
    FROM A as basetable

    LEFT JOIN (
      SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
      FROM
      (
        SELECT
          url,
            CONCAT(
              '{"key": "', repeated_nested.key, '",',
              ' "property": "', repeated_nested.property, '",',
              ' "property2": "', mapping_table.property2, '"',
              '}'
            )
          ) as repeated_nested_json
        FROM (
          SELECT
            url, repeated_nested.key, repeated_nested.property
          FROM A
          GROUP BY url, repeated_nested.key, repeated_nested.property
        ) as urltable

         LEFT OUTER JOIN [SDF.alchemy_to_ric]
            AS mapping_table
            ON urltable.repeated_nested.key=mapping_table.key
      )
      GROUP BY url
    ) as companytable
    ON basetable.url = urltable.url
  ),

  // input columns:
  url, repeated_nested_json,

  // output schema:
  "[{'name': 'url', 'type': 'string'},
   {'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
    [ { 'name': 'key', 'type':'string' },
      { 'name': 'property', 'type':'string' },
      { 'name': 'property2', 'type':'string' }]
   }]",

  // UDF:
  "function(row, emit) {
    parsed_repeated_nested = [];
    try {
      if ( row.repeated_nested_json != null ) {
          parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
      }
    } catch (ex) { }

    emit({
      url: row.url,
      repeated_nested: parsed_repeated_nested
    });
  }"
)

This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url and repeated_nested_json they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.

In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote that could help me. But I can't figure out how to make it work.

Does anybody have advice on how they have dealt with this sort of situation?

Upvotes: 10

Views: 4869

Answers (2)

thomaspark
thomaspark

Reputation: 488

@AndrewBackes - we rolled out some fixes for UDF memory-related issues this week; there are some details on the root cause here https://stackoverflow.com/a/36275462/5265394 and here https://stackoverflow.com/a/35563562/5265394.

The UDF version of your query is now working for me; could you verify on your side?

Upvotes: 0

oulenz
oulenz

Reputation: 1244

I have never had to do this, but you should be able to use flatten, then join, then use nest to get repeated fields again.

The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.

See also this answer for how to get nest to work.

Upvotes: 1

Related Questions