mravey
mravey

Reputation: 4510

Get Most Recent Column Value With Nested And Repeated Fields

I have a table with the following structure:
enter image description here

and the following data in it:

[
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:25 UTC"
  },
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:16 UTC"
  }
]

What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt. I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.

Upvotes: 2

Views: 1557

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173171

I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.

So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario

So, based on your example, assume you have table as below

enter image description here

and you expect to get most recent records based on createdAt column, so result will look like:

enter image description here

Below code does this:

SELECT name, age, createdAt, addresses.city
FROM JS( 
  ( // input table 
    SELECT name, age, createdAt, NEST(city) AS addresses 
    FROM (
      SELECT name, age, createdAt, addresses.city 
      FROM (
        SELECT 
          name, age, createdAt, addresses.city, 
          MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
        FROM yourTable
      )
      WHERE createdAt = lastAt
    )
    GROUP BY name, age, createdAt
  ), 
  name, age, createdAt, addresses, // input columns 
  "[ // output schema 
    {'name': 'name', 'type': 'STRING'},
    {'name': 'age', 'type': 'INTEGER'},
    {'name': 'createdAt', 'type': 'INTEGER'},
    {'name': 'addresses', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'city', 'type': 'STRING'}
       ]    
     }
  ]", 
  "function(row, emit) { // function 
    var c = []; 
    for (var i = 0; i < row.addresses.length; i++) { 
      c.push({city:row.addresses[i]});
    }; 
    emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c}); 
  }"
) 

the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten

The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)

Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?

I hope this is good foundation for you to experiment with and make your case work!

Upvotes: 3

Related Questions